When we find an extent state that starts before our range's start we
split it and jump into the 'search_again' label with our start offset
remaining the same, making us then go to the 'again' label and search
again for an extent state that contains the 'start' offset, and this
time it finds the same extent state but with its start offset set to
our range's start offset (due to the split). This is because we have
consumed the preallocated extent state record and we may need to split
again, and by jumping to 'again' we release the spinlock, allocate a new
prealloc state and restart the search.
However we may not need to restart and allocate a new extent state in
case we don't find extent states that cross our end offset, therefore
no need for further extent state splits, or we may be able to do an
atomic allocation (which is quick even if it fails). In these cases
it's a waste to restart the search.
So change the behaviour to do the restart only if we need to reschedule,
otherwise fall through - if we need to allocate an extent state for split
operations, we will try an atomic allocation and if that fails we will do
the restart as before.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several variables are defined as integers but used as booleans, and the
'delete' variable can be made const since it's not changed after being
declared. So change them to proper booleans and simplify setting their
value.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a couple error branches where we have an error stored in the 'err'
variable and then jump to the 'out' label, however we don't return that
error, we just return 0. Normally this is not a problem since those error
branches call extent_io_tree_panic() which triggers a BUG() call, however
it's possible to have rather exotic kernel config with CONFIG_BUG disabled
in which case the BUG() call does nothing and we fallthrough. So make sure
to return the error, not just to fix that exotic case but also to make the
code less confusing. While at it also rename the 'err' variable to 'ret'
since this is the style we prefer and use more widely.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If split_state() returned an error we call extent_io_tree_panic() which
will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an
uncommon and exotic scenario, then we fallthrough and hit a use after free
when calling clear_state_bit() since the extent state record which the
local variable 'prealloc' points to was freed by split_state().
So jump to the label 'out' after calling extent_io_tree_panic() and set
the 'prealloc' pointer to NULL since split_state() has already freed it
when it hit an error.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to check if split_state() returned an error twice, instead
unify into a single if statement after setting 'prealloc' to NULL, because
on error split_state() frees the 'prealloc' extent state record.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is introduced by commit a512bbf855 ("Btrfs: superblock
duplication") at the beginning of btrfs.
It leaved a comment saying we'd need a special mount option to read all
super blocks, but it's never been implemented and there was not
need/request for it. The check/rescue tools are able to start from a
specific copy and use it as primary eventually.
This means btrfs_read_dev_super() is always reading the first super
block, making all the code finding the latest super block unnecessary.
Just remove that function and replace all call sites with
btrfs_read_disk_super(bdev, 0, false).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two functions to read a super block from a block device:
- btrfs_read_dev_one_super()
Exported from disk-io.c
- btrfs_read_disk_super()
Local to volumes.c
And they have some minor differences:
- btrfs_read_dev_one_super() uses @copy_num
Meanwhile btrfs_read_disk_super() relies on the physical and expected
bytenr passed from the caller.
The parameter list of btrfs_read_dev_one_super() is more user
friendly.
- btrfs_read_disk_super() makes sure the label is NUL terminated
We do not need two different functions doing the same job, so merge the
behavior into btrfs_read_disk_super() by:
- Remove btrfs_read_dev_one_super()
- Export btrfs_read_disk_super()
The name pairs with btrfs_release_disk_super() perfectly.
- Change the parameter list of btrfs_read_disk_super() to mimic
btrfs_read_dev_one_super()
All existing callers are calculating the physical address and expect
bytenr before calling btrfs_read_disk_super() already.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The `free_eb` label is used only once. Simplify by moving the code inplace.
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have this ugly back and forth with the btree writeback
where we find the folio, find the eb associated with that folio, and
then attempt to writeback. This results in two different paths for
subpage ebs and >= page size ebs.
Clean this up by adding our own infrastructure around looking up tagged
ebs and writing the ebs out directly. This allows us to unify the
subpage and >= pagesize IO paths, resulting in a much cleaner writeback
path for extent buffers.
I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM. I used
smallfiles100k, but reduced the files to 1k to make it run faster, the
results are as follows, with the statistically significant improvements
marked with *, there were no regressions. fsperf was run with -n 10 for
both runs, so the baseline is the average 10 runs and the test is the
average of 10 runs.
smallfiles100k results
metric baseline current stdev diff
================================================================================
avg_commit_ms 68.58 58.44 3.35 -14.79% *
commits 270.60 254.70 16.24 -5.88%
dev_read_iops 48 48 0 0.00%
dev_read_kbytes 1044 1044 0 0.00%
dev_write_iops 866117.90 850028.10 14292.20 -1.86%
dev_write_kbytes 10939976.40 10605701.20 351330.32 -3.06%
elapsed 49.30 33 1.64 -33.06% *
end_state_mount_ns 41251498.80 35773220.70 2531205.32 -13.28% *
end_state_umount_ns 1.90e+09 1.50e+09 14186226.85 -21.38% *
max_commit_ms 139 111.60 9.72 -19.71% *
sys_cpu 4.90 3.86 0.88 -21.29%
write_bw_bytes 42935768.20 64318451.10 1609415.05 49.80% *
write_clat_ns_mean 366431.69 243202.60 14161.98 -33.63% *
write_clat_ns_p50 49203.20 20992 264.40 -57.34% *
write_clat_ns_p99 827392 653721.60 65904.74 -20.99% *
write_io_kbytes 2035940 2035940 0 0.00%
write_iops 10482.37 15702.75 392.92 49.80% *
write_lat_ns_max 1.01e+08 90516129 3910102.06 -10.29% *
write_lat_ns_mean 366556.19 243308.48 14154.51 -33.62% *
As you can see we get about a 33% decrease runtime, with a 50%
throughput increase, which is pretty significant.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation for changing how we do writeout of extent buffers, start
tagging the extent buffer xarray with DIRTY and WRITEBACK to make it
easier to find extent buffers that are in either state.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to fully utilize xarray tagging to improve writeback we need to
convert the buffer_radix to a proper xarray. This conversion is
relatively straightforward as the radix code uses the xarray underneath.
Using xarray directly allows for quite a lot less code.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use the "btrfs-" prefix for our workqueues, the discard has
underscore instead of dash, so unify it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have only two chunk allocation policies right now and the
switch/cases don't handle an unknown one properly. The error is in the
impossible category (the policy is stored only in memory), we don't have
to BUG(), falling back to regular policy should be safe.
Signed-off-by: David Sterba <dsterba@suse.com>
Both the variable and the parameter are used as logical indicators so
convert them to bool.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Old code has a lot of int for bool return values, bool is recommended
and done in new code. Convert the trivial cases that do simple 0/false
and 1/true. Functions comment are updated if needed.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs subpage support (fs block < page size) was introduced, a
subpage filesystem will only reject tree blocks which cross page
boundaries.
This used to be a compromise to simplify the tree block handling and
still allowing subpage cases to read some old converted filesystems
which did not have proper chunk alignment.
But in practice, suppose we have the following unaligned tree block on a
64K page sized system:
0 32K 44K 60K 64K
| |///////////////| |
Although btrfs has no problem reading the tree block at [44K, 60K), if
extent allocator is allocating another tree block, it may choose the
range [60K, 74K), as extent allocator has no awareness if it's a subpage
metadata request or not.
Then we'd get -EINVAL from the following sequence:
btrfs_alloc_tree_block()
|- btrfs_reserve_extent()
| Which returned range [60K, 74K)
|- btrfs_init_new_buffer()
|- btrfs_find_create_tree_block()
|- alloc_extent_buffer()
|- check_eb_alignment()
Which returned -EINVAL, because the range crosses page
boundary.
This situation will not fix itself and should mostly mark the fs
read-only.
Thankfully we didn't really get such reports in the real world because:
- The original unaligned tree block is only caused by older
btrfs-convert
It's before the btrfs-convert rework was done in v4.6, where converted
btrfs filesystem can have metadata block groups which are not aligned
to nodesize nor stripe size (64K).
But after btrfs-progs v4.6, all chunks allocated will be stripe (64K)
aligned, thus no more such problem.
Considering how old the fix is (v4.6 was released almost 10 years ago),
subpage support for btrfs was introduced in v5.15, it should be safe to
reject those unaligned tree blocks.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is just a trivial change. The code looks a bit more readable this way, IMO.
Move initialization of existing_folio to the beginning of the retry loop
so it's set to NULL at one place.
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Trivial renames to unify the naming of blk_status_t variables/parameters.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_submit_chunk().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're using 'status' for the blk_status_t variables, rename 'ret' so we can
use it for generic errors.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to have a separate variable to read the bio status, 'ret'
works for that just fine so remove 'error'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're using 'status' for the blk_status_t variables, rename 'ret' so we
can use it for proper return type.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_submit_chunk().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unsigned type is a recommended practice (CWE-190, CWE-194) for bit
shifts to avoid problems with potential unwanted sign extensions.
Although there are no such cases in btrfs codebase, follow the
recommendation.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
First added (but not effectively used) in 02c372e1f0 ("btrfs: add
support for inserting raid stripe extents"). The structure is
initialized to zeros so the only use in btrfs_insert_one_raid_extent()
u64 length = bioc->stripes[i].length;
struct btrfs_raid_stride *raid_stride = &stripe_extent->strides[i];
if (length == 0)
length = bioc->size;
the 'if' always happens.
Last use in 4016358e85 ("btrfs: remove unused variable length in
btrfs_insert_one_raid_extent()") was an obvious cleanup. It seems to be
safe to remove, raid-stripe-tree works without using it since 6.6.
This was found by tool https://github.com/jirislaby/clang-struct .
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using the helper makes it a bit more clear that we're accessing the
first list entry.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The use of ASSERT(0) is maybe useful for some cases but more like a
notice for developers. Assertions can be compiled in independently so
convert it to a debugging helper.
The difference is that it's just a warning and will not end up in BUG().
The converted cases are in connection with proper error handling.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the conditional warning instead of typing the whole condition.
Optional message is printed where it seems clear what could be the
problem.
Conversion is left out in btree_csum_one_bio() because of the additional
condition.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add conditional WARN() wrapper that's enabled only in debug build. It
should be used for unexpected conditions that should be noisy. Use it
instead of ASSERT(0). As it will not lead to BUG() make sure that
continuing is still possible, e.g. the error is handled anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The file volumes.c has about 40 assertions and half of them are suitable
for ASSERT() with additional data.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently ASSERT() prints the stringified condition and without macro
expansions so simple constants like BTRFS_MAX_METADATA_BLOCKSIZE remain
readable in the output.
There are expressions where we'd like to see the exact values but all we
get is something like:
assertion failed: em->start <= start && start < extent_map_end(em), in fs/btrfs/extent_map.c:613
It would be nice to be able to print any additional information to help
understand the problem. With some preprocessor magic and compile-time
optimizations we can enhance ASSERT to work like that as well:
ASSERT(value > limit, "value=%llu limit=%llu", value, limit);
with free-form printk arguments that will be part of the assertion
message.
Pros:
- helps debugging and understanding reported problems
- the optional format is verified at compile-time
Cons:
- increases the .ko size
- writing the assertion code is repetitive (condition, format, values)
- format and variable type must match (extra lookup)
- needs gcc 8.x and newer, otherwise it's the short format
Recommended use is for non-trivial expressions, so basic ASSERT(value) can be
used for pointers or sometimes integers.
The format has been slightly updated to also print the result of the
evaluation of the condition, appended to the stringified condition as
"condition :: <value>".
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit b28b1f0ce4 ("btrfs: delayed-ref: Introduce better documented
delayed ref structures") introduced BTRFS_REF_LAST, which can be used
for sanity checking, e.g. in switch/case or for loops.
In btrfs_ref_type() there is an assertion
ASSERT(ref->type == BTRFS_REF_DATA || ref->type == BTRFS_REF_METADATA);
to validate the values so we don't need the ending enum.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This removes the last direct poke into bvec internals in btrfs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of the old @page and @page_offset pair inside scrub, here we can
directly use the virtual address for a sector.
This has the following benefit:
- Simplified parameters
A single @kaddr will repair @page and @page_offset.
- No more unnecessary kmap/kunmap calls
Since all pages utilized by scrub is allocated by scrub, and no
highmem is allowed, we do not need to do any kmap/kunmap.
And add an ASSERT() inside the new scrub_stripe_get_kaddr() to
catch any unexpected highmem page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using a @page + @pg_offset pair inside sector_ptr structure,
use a single physical address instead.
This allows us to grab both the page and offset from a single u64 value.
Although we still need an extra bool value, @has_paddr, to distinguish
if the sector is properly mapped (as the 0 physical address is totally
valid).
This change doesn't change the size of structure sector_ptr, but reduces
the parameters of several functions.
Note: the original idea and patch is from Christoph Hellwig
(https://lore.kernel.org/linux-btrfs/20250409111055.3640328-7-hch@lst.de/)
but the final implementation is different.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[ Use physical addresses instead to handle highmem. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Flatten the two loops by open coding bio_for_each_segment() and advancing
the iterator one sector at a time.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Fix a bug that @offset is not increased. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move kmapping the page out of btrfs_check_sector_csum().
This allows using bvec_kmap_local() where suitable and reduces the number
of kmap*() calls in the raid56 code.
This also means btrfs_check_sector_csum() will only accept a properly
kmapped address.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using physical address has the following advantages:
- All involved callers only need a single pointer
Instead of the old @folio + @offset pair.
- No complex poking into the bio_vec structure
As a bio_vec can be single or multiple paged, grabbing the real page
can be quite complex if the bio_vec is a multi-page one.
Instead bvec_phys() will always give a single physical address, and it
cab be easily converted to a page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio implementation is not something we should really mess around,
and we shouldn't recalculate the pos from the folio over and over.
Instead just track then end of the current bio in logical file offsets
in the btrfs_bio_ctrl, which is much simpler and easier to read.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
end_bbio_data_read() checks that each iterated folio fragment is aligned
and justifies that with block drivers advancing the bio. But block
driver only advance bi_iter, while end_bbio_data_read() uses
bio_for_each_folio_all() to iterate the immutable bi_io_vec array that
can't be changed by drivers at all.
Furthermore btrfs has already did the alignment check of the file
offset inside submit_one_sector(), and the size is fixed to fs block
size, there is no need to re-do the alignment check again inside the
endio function.
So just remove the unnecessary alignment check along with the incorrect
comment.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment mistakenly says the function is returning PTR_ERR instead of
ERR_PTR. Fix it and update it so it's more descriptive.
Signed-off-by: Charles Han <hanchunchao@inspur.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ Enhance the function comment. ]
Signed-off-by: David Sterba <dsterba@suse.com>
Make this simpler by returning directly when there's no other cleanup
needed.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Do not duplicate the cleanup after failed initialization
in btrfs_bioset_init() and reuse the exit function btrfs_bioset_exit().
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using 'i' for a parameter is confusing and conforming to current
preferences, so rename it to 'iter'.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we reject large folios for defrag gracefully, but the
implementation itself is already mostly large folios compatible.
There are several parts of defrag in btrfs:
- Extent map checking
Aka, defrag_collect_targets(), which prepares a list of target ranges
that should be defragged.
This part is completely folio unrelated, thus it doesn't care about
the folio size.
- Target folio preparation
Aka, defrag_prepare_one_folio(), which lock and read (if needed) the
target folio.
Since folio read and lock are already supporting large folios, this
part needs only minor changes.
- Redirty the target range of the folio
This is already done in a way supporting large folios.
So it's pretty straightforward to enable large folios for defrag:
- Do not reject large folios for experimental builds
This affects the large folio check inside defrag_prepare_one_folio().
- Wait for ordered extents of the whole folio in
defrag_prepare_one_folio()
- Lock the whole extent range for all involved folios in
defrag_one_range()
- Allow the folios[] array to be partially empty
Since we can have large folios, folios[] will not always be full.
This affects:
* How to allocate folios in defrag_one_range()
Now we cannot use page index, but use the end position of the folio
as an iterator.
* How to free the folios[] array
If we hit an empty slot, it means we have large folios and already
hit the end of the array.
* How to mark the range dirty
Instead of use page index directly, we have to go through each
folio, and check if the folio covers the defrag target inside
defrag_one_locked_target().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All compression algorithms inside btrfs are not supporting large folios
due to the following points:
- btrfs_calc_input_length() is assuming page sized folio
- kmap_local_folio() usages are using offset_in_page()
Prepare them to support large data folios by:
- Add a folio parameter to btrfs_calc_input_length()
And use that folio parameter to calculate the correct length.
Since we're here, also add extra ASSERT()s to make sure the parameter
@cur is inside the folio range.
This affects only zlib and zstd. Lzo compresses at most one block at a
time, thus not affected.
- Use offset_in_folio() to calculate the kmap_local_folio() offset
This affects all 3 algorithms.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to have a double underscore prefix as there's no variant
of the function without it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to have a double underscore prefix as there's no variant
of the function without it anymore.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most of our tracepoints have the 'btrfs_' prefix in their names but a few
of them are missing, making it inconsistent. So add the prefix to the ones
that are missing it, creating consistency, making it clear for users these
are btrfs tracepoints and eventually avoid name collisions with other
tracepoints defined by other kernel subsystems.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function needs only to return true or false, so there's no need to
return an integer. Currently it returns 0 when a range with the given
bits is set and 1 when not found, which is a bit counter intuitive too.
So change the function to return a bool instead, returning true when a
range is found and false otherwise. Update the function's documentation
to mention the return value too.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that set_extent_bit() was renamed to btrfs_set_extent_bit(), there's
no need to have a __set_extent_bit() function, we can just remove the
double underscore prefix, which we try to avoid according to the coding
style conventions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename the remaning exported functions that don't have a 'btrfs_' prefix.
By convention exported functions should have such prefix to make it clear
they are btrfs specific and to avoid collisions with functions from
elsewhere in the kernel.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.
Rename the function to add 'btrfs_' prefix to it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their names to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've tested that we are dealing with io tree that is associated to an
inode (its owner is IO_TREE_INODE_IO), so there's no need to call
btrfs_extent_io_tree_to_inode() in a separate line and we just assign
tree->inode to the local inode variable when we declare it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs. Also remove the 'const' suffix from extent_io_tree_to_inode_const()
since there's no non-const variant anymore and makes the naming consistent
with extent_io_tree_to_fs_info() (no 'const' suffix and returns a const
pointer).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.
So rename it to btrfs_set_extent_bit().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. One of them has a
double underscore prefix which is also discouraged.
So remove double underscore prefix where applicable and add a 'btrfs_'
prefix to their name to make it clear they are from btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. Their double
underscore prefix is also discouraged.
So remove their double underscore prefix, add a 'btrfs_' prefix to their
name to make it clear they are from btrfs and a '_bits' suffix to avoid
collision with btrfs_lock_extent() and btrfs_try_lock_extent().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These trace events don't have the 'btrfs_' prefix in their name, unlike
the other trace events from extent-io-tree.c. So add the prefix to make
them consistent and follow coding style conventions too.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions aren't used outside extent-io-tree.c, but yet one of them
(extent_io_tree_to_inode()) is unnecessarily exported in the header.
Furthermore their single use is in a pattern like this:
if (is_inode_io_tree(tree))
foo(extent_io_tree_to_inode(tree), ...);
So we're effectively unnecessarily adding more indirection, checking
twice if tree->owner == IO_TREE_INODE_IO before getting the inode and
doing a non-inline function call to get tree->inode.
Simplify this by removing these helper functions and instead doing
thing like this:
if (tree->owner == IO_TREE_INODE_IO)
foo(tree->inode, ...);
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add more unlikely annotations to branches that lead to EUCLEAN, overall
in the tree checker this helps to reorder instructions for the no-error
case.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use the following pattern to detect if the folio contains
the end of a file:
if (folio->index == end_index)
folio_zero_range();
But that only works if the folio is page sized.
For the following case, it will not work and leave the range beyond EOF
uninitialized:
The page size is 4K, and the fs block size is also 4K.
16K 20K 24K
| | | |
|
EOF at 22K
And we have a large folio sized 8K at file offset 16K.
In that case, the old "folio->index == end_index" will not work, thus
the range [22K, 24K) will not be zeroed out.
Fix the following call sites which use the above pattern:
- add_ra_bio_pages()
- extent_writepage()
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside functions unlock_delalloc_folio() and lock_delalloc_folios(), we
have the following early exits:
if (index == locked_folio->index && end_index == index)
return;
This allows us to exit early if the range is inside the same locked
folio.
However the current check relies on page sized folios, if we have a large
folio that contains @index but not at @index, then the early exit will
no longer trigger.
Furthermore without the above early check, the existing code can handle it
well, as both __process_folios_contig() and lock_delalloc_folios() will
skip any folio page lock/unlock if it's on the locked folio.
Here we remove the early exits and let the existing code handle the
same index case, to make the code a little simpler.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function itself is already taking large folios into consideration,
just remove the ASSERT(!folio_test_large()) line.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The subpage handling code has two locations not supporting large folios:
- btrfs_attach_subpage()
Which is doing a metadata specific ASSERT() check.
But for the future large data folios support, that check is too
generic. Since it's metadata specific, only check the ASSERT() for
metadata.
- btrfs_subpage_assert()
Just remove the "ASSERT(folio_order(folio) == 0)" check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is doing an ASSERT() checking the folio order, but all
later functions are handling large folios properly, thus we can safely
remove that ASSERT().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only blockage is the ASSERT() rejecting large folios, just remove
it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_page_mkwrite() has an explicit ASSERT() checking the
folio order.
To make it support large data folios, we need to:
- Remove the ASSERT(folio_order(folio) == 0)
- Use folio_contains() to check if the folio covers the last page
Otherwise the code is already supporting large folios well.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently put_file_data() can only accept a page sized folio. However
the function itself is not that complex, it's just copying data from
filemap folio into the send buffer.
Make it support large data folios:
- Change the loop to use file offset instead of page index
- Calculate @pg_offset and @cur_len after getting the folio
- Remove the "WARN_ON(folio_order(folio));" line
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The again label is here to retry to get the folio for the current index.
When triggering that label, there is no advance of the iterator.
So it can be replaced by a simple "continue" and remove the again label.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to check if the current record's start offset is greater
than the end offset, as before we just tested if it was greater than the
start offset - and if it's not it means it's less than or equal to the
start offset, so it can not be greater than the end offset, as our start
offset is always smaller than the end offset.
So remove that check and also add an assertion to verify the start offset
is smaller then the end offset.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The overflow detection for the start offset of the next record is not
really necessary, we can just stop iterating if the current record ends at
or after out end offset. This removes the need to test if the current
record end offset is (u64)-1 and to check if adding 1 to the current
end offset results in 0.
By testing only if the current record ends at or after the end offset, we
also don't need anymore to test the new start offset at the head of the
while loop.
This makes both the source code and assembly code simpler, more efficient
and shorter (reducing the object text size).
Also remove the pointless initialization to NULL of the state variable, as
we don't use it before the first assignment to it. This may help avoid
some warnings with clang tools such as the one reported/fixed by commit
966de47ff0 ("btrfs: remove redundant initialization of variables in
log_new_ancestors").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree_search() function always returns an entry that either contains
the search offset or the first entry in the tree that starts after the
offset. So checking at find_first_extent_bit_state() if the returned
entry ends at or after the search offset is pointless. Remove the check.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several things wrong with the documentation:
1) At the top it's only mentioned that we search for an entry containing
the given offset, but when such entry does not exists we search for
the first entry that starts and ends after that offset;
2) It mentions that @node_ret and @parent_ret aren't changed if the
returned entry contains the given offset - that is true only if the
returned entry starts exactly at @offset, otherwise those arguments
are changed;
3) It mentions that if no entry containing offset is found then we return
the first entry ending before the offset - that is not true, we return
the first entry that starts and ends after that offset;
4) It also mentions that NULL is never returned. This is false as in case
there's no entry containing offset or any entry that starts and ends
after offset, NULL is returned.
So fix the documentation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of keeping track of the minimum start offset of the next record
and detecting overflow every time we update that offset to be the sum of
current record's end offset plus one, we can simply exit when the current
record ends at or beyond our end offset and forget about updating the
start offset on every iteration and testing for it at the top of the loop.
This makes both the source code and assembly code simpler, more efficient
and shorter (reducing the object text size).
Also remove the pointless initialization to NULL of the state variable, as
we don't use it before the first assignment to it. This may help avoid
some warnings with clang tools such as the one reported/fixed by commit
966de47ff0 ("btrfs: remove redundant initialization of variables in
log_new_ancestors").
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several places are using clear_extent_bit() and passing a NULL value for
the 'cached' argument, which is pointless as they can use instead
clear_extent_bits().
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using __clear_extent_bit() we can use clear_extent_bits() since
we pass a NULL value for the cached and changeset arguments.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using __clear_extent_bit() we can use clear_extent_bit() since
we pass a NULL value for the changeset argument.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG WITH EXPERIMENTAL LARGE FOLIOS]
When testing the experimental large data folio support with compression,
there are several ASSERT()s triggered from btrfs_decompress_buf2page()
when running fsstress with compress=zstd mount option:
- ASSERT(copy_len) from btrfs_decompress_buf2page()
- VM_BUG_ON(offset + len > PAGE_SIZE) from memcpy_to_page()
[CAUSE]
Inside btrfs_decompress_buf2page(), we need to grab the file offset from
the current bvec.bv_page, to check if we even need to copy data into the
bio.
And since we're using single page bvec, and no large folio, every page
inside the folio should have its index properly setup.
But when large folios are involved, only the first page (aka, the head
page) of a large folio has its index properly initialized.
The other pages inside the large folio will not have their indexes
properly initialized.
Thus the page_offset() call inside btrfs_decompress_buf2page() will
result garbage, and completely screw up the @copy_len calculation.
[FIX]
Instead of using page->index directly, go with page_pgoff(), which can
handle non-head pages correctly.
So introduce a helper, file_offset_from_bvec(), to get the file offset
from a single page bio_vec, so the copy_len calculation can be done
correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify conditionally reading an rb_entry(), there's the
rb_entry_safe() helper that checks the node pointer for NULL so we don't
have to write it explicitly.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Allow get_range_bits() to take an extent state pointer to pointer argument
so that we can cache the first extent state record in the target range, so
that a caller can use it for subsequent operations without doing a full
tree search. Currently the only user is try_release_extent_state(), which
then does a call to __clear_extent_bit() which can use such a cached state
record.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the release_folio callback (from struct address_space_operations) is
invoked we don't allow the folio to be released if its range is currently
locked in the inode's io_tree, as it may indicate the folio may be needed
by the task that locked the range.
However if the range is locked because an ordered extent is finishing,
then we can safely allow the folio to be released because ordered extent
completion doesn't need to use the folio at all.
When we are under memory pressure, the kernel starts writeback of dirty
pages (folios) with the goal of releasing the pages from the page cache
after writeback completes, however this often is not possible on btrfs
because:
* Once the writeback completes we queue the ordered extent completion;
* Once the ordered extent completion starts, we lock the range in the
inode's io_tree (at btrfs_finish_one_ordered());
* If the release_folio callback is called while the folio's range is
locked in the inode's io_tree, we don't allow the folio to be
released, so the kernel has to try to release memory elsewhere,
which may result in triggering more writeback or releasing other
pages from the page cache which may be more useful to have around
for applications.
In contrast, when the release_folio callback is invoked after writeback
finishes and before ordered extent completion starts or locks the range,
we allow the folio to be released, as well as when the release_folio
callback is invoked after ordered extent completion unlocks the range.
Improve on this by detecting if the range is locked for ordered extent
completion and if it is, allow the folio to be released. This detection
is achieved by adding a new extent flag in the io_tree that is set when
the range is locked during ordered extent completion.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop reference to pages from the comment since the function is fully folio
aware and works regardless of how many pages are in the folio. Also while
at it, capitalize the first word and make it more explicit that
release_folio is a callback from struct address_space_operations.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_punch_hole_lock_range() needs to make sure there is
no other folio in the range, thus it goes with filemap_range_has_page(),
which works pretty fine.
But if we have large folios, under the following case
filemap_range_has_page() will always return true, forcing
btrfs_punch_hole_lock_range() to do a very time consuming busy loop:
start end
| |
|//|//|//|//| | | | | | | | |//|//|
\ / \ /
Folio A Folio B
In the above case, folio A and B contain our start/end indexes, and there
are no other folios in the range. Thus we do not need to retry inside
btrfs_punch_hole_lock_range().
To prepare for large data folios, introduce a helper,
check_range_has_page(), which will:
- Shrink the search range towards page boundaries
If the rounded down end (exclusive, otherwise it can underflow when @end
is inside the folio at file offset 0) is no larger than the rounded up
start, it means the range contains no other pages other than the ones
covering @start and @end.
Can return false directly in that case.
- Grab all the folios inside the range
- Skip any large folios that cover the start and end indexes
- If any other folios are found return true
- Otherwise return false
This new helper is going to handle both large folios and regular ones.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves the following modifications:
- Set the order flags for __filemap_get_folio() inside
prepare_one_folio()
This will allow __filemap_get_folio() to create a large folio if the
address space supports it.
- Limit the initial @write_bytes inside copy_one_range()
If the largest folio boundary splits the initial write range, there is
no way we can write beyond the largest folio boundary.
This is done by a simple helper calc_write_bytes().
- Release exceeding reserved space if the folio is smaller than expected
Which is doing the same handling when short copy happens.
All the preparations should not change the behavior when the largest
folio order is 0.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several things not ideal in copy_one_range():
- Unnecessary temporary variables
* block_offset
* reserve_bytes
* dirty_blocks
* num_blocks
* release_bytes
These are utilized to handle short-copy cases.
- Inconsistent handling of btrfs_delalloc_release_extents()
There is a hidden behavior that, after reserving metadata for X bytes
of data write, we have to call btrfs_delalloc_release_extents() with X
once and only once.
Calling btrfs_delalloc_release_extents(X - 4K) and
btrfs_delalloc_release_extents(4K) will cause outstanding extents
accounting to go wrong.
This is because the outstanding extents mechanism is not designed to
handle shrinking of reserved space.
Improve above situations by:
- Use a single @reserved_start and @reserved_len pair
Now we reserve space for the initial range, and if a short copy
happened and we need to shrink the reserved space, we can easily
calculate the new length, and update @reserved_len.
- Introduce helpers to shrink reserved data and metadata space
This is done by two new helpers, shrink_reserved_space() and
btrfs_delalloc_shrink_extents().
The later will do a better calculation if we need to modify the
outstanding extents, and the first one will be utilized inside
copy_one_range().
- Manually unlock, release reserved space and return if no byte is
copied
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The EXTENT_UPTODATE io tree flag is now used only to mark ranges in the
fs_info->excluded_extents as used by super blocks and not available for
extent allocation (to prevent adding those ranges as free space in the
in memory space caches). As we can use any flag for that purpose, and
we are using EXTENT_DIRTY for the pinned extents io tree for example,
remove the EXTENT_UPTODATE flag and use instead EXTENT_DIRTY for the
excluded extents io tree.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_add_new_free_space() we keep searching for ranges in the excluded
extents io tree that have the EXTENT_DIRTY bit set, however we never ever
set that bit for ranges in that tree. That is a leftover from when that
function used the global freed extents trees (fs_info->freed_extents[2]),
where we used both the EXTENT_DIRTY and EXTENT_UPTODATE bits, but those
trees are gone with commit fe119a6eeb ("btrfs: switch to per-transaction
pinned extents"), which introduced the fs_info->excluded_extents io tree,
where only EXTENT_UPTODATE is set.
So remove the EXTENT_DIRTY bit search at btrfs_add_new_free_space().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After commit 52b029f427 ("btrfs: remove unnecessary EXTENT_UPTODATE
state in buffered I/O path") we never set EXTENT_UPTODATE in an inode's
io_tree anymore, but we still have some code attempting to clear that
bit from an inode's io_tree. Remove that code as it doesn't do anything
anymore. The sole use of the EXTENT_UPTODATE bit is for the excluded
extents io_tree (fs_info->excluded_extents), which is used to track the
locations of super blocks, so that their ranges are never marked as free,
making them unavailable for extent allocation.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fsync a file (or directory) that has no more hard links, because
while a process had a file descriptor open on it, the file's last hard
link was removed and then the process did an fsync against the file
descriptor, after a power failure or crash the file still exists after
replaying the log.
This behaviour is incorrect since once an inode has no more hard links
it's not accessible anymore and we insert an orphan item into its
subvolume's tree so that the deletion of all its items is not missed in
case of a power failure or crash.
So after log replay the file shouldn't exist anymore, which is also the
behaviour on ext4, xfs, f2fs and other filesystems.
Fix this by not ignoring inodes with zero hard links at
btrfs_log_inode_parent() and by committing an inode's delayed inode when
we are not doing a fast fsync (either BTRFS_INODE_COPY_EVERYTHING or
BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). This
last step is necessary because when removing the last hard link we don't
delete the corresponding ref (or extref) item, instead we record the
change in the inode's delayed inode with the BTRFS_DELAYED_NODE_DEL_IREF
flag, so that when the delayed inode is committed we delete the ref/extref
item from the inode's subvolume tree - otherwise the logging code will log
the last hard link and therefore upon log replay the inode is not deleted.
The base code for a fstests test case that reproduces this bug is the
following:
. ./common/dmflakey
_require_scratch
_require_dm_target flakey
_require_mknod
_scratch_mkfs >>$seqres.full 2>&1 || _fail "mkfs failed"
_require_metadata_journaling $SCRATCH_DEV
_init_flakey
_mount_flakey
touch $SCRATCH_MNT/foo
# Commit the current transaction and persist the file.
_scratch_sync
# A fifo to communicate with a background xfs_io process that will
# fsync the file after we deleted its hard link while it's open by
# xfs_io.
mkfifo $SCRATCH_MNT/fifo
tail -f $SCRATCH_MNT/fifo | \
$XFS_IO_PROG $SCRATCH_MNT/foo >>$seqres.full &
XFS_IO_PID=$!
# Give some time for the xfs_io process to open a file descriptor for
# the file.
sleep 1
# Now while the file is open by the xfs_io process, delete its only
# hard link.
rm -f $SCRATCH_MNT/foo
# Now that it has no more hard links, make the xfs_io process fsync it.
echo "fsync" > $SCRATCH_MNT/fifo
# Terminate the xfs_io process so that we can unmount.
echo "quit" > $SCRATCH_MNT/fifo
wait $XFS_IO_PID
unset XFS_IO_PID
# Simulate a power failure and then mount again the filesystem to
# replay the journal/log.
_flakey_drop_and_remount
# We don't expect the file to exist anymore, since it was fsynced when
# it had no more hard links.
[ -f $SCRATCH_MNT/foo ] && echo "file foo still exists"
_unmount_flakey
# success, all done
echo "Silence is golden"
status=0
exit
A test case for fstests will be submitted soon.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's an explanation of how space info works at the top of
fs/btrfs/space-info.c, which makes reference to a variable called
bytes_may_reserve. There's nothing called that in the code, and wasn't
at time the comment was written; as far I can tell this is a typo, and
it should actually be bytes_may_use.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag is set after inserting the eb to the buffer tree and cleared
on it's removal. It was added in commit 34b41acec1 ("Btrfs: use a
bit to track if we're in the radix tree") and wanted to make use of it,
faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting
code"). Both are 10+ years old, we can remove the flag.
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag is no longer being used. It was added by commit a826d6dcb3
("Btrfs: check items for correctness as we search") but it's no longer
being used after commit f26c923860 ("btrfs: remove reada
infrastructure").
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag is no longer being used. It was added by commit ab0fff0305
("btrfs: add READAHEAD extent buffer flag") and used in commits:
79fb65a1f6 ("Btrfs: don't call readahead hook until we have read the entire eb")
78e62c02ab ("btrfs: Remove extent_io_ops::readpage_io_failed_hook")
371cdc0700 ("btrfs: introduce subpage metadata validation check")
Finally all the code using it was removed by commit f26c923860 ("btrfs: remove
reada infrastructure").
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag was added by commit 656f30dba7 ("Btrfs: be aware of btree
inode write errors to avoid fs corruption") but it stopped being used
after commit 046b562b20 ("btrfs: use a separate end_io handler for
read_extent_buffer").
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside the main loop of btrfs_buffered_write() we are doing a lot of
heavy lifting inside a while() loop.
This makes it pretty hard to read, factor out the content into a helper,
copy_one_range() to do the work.
This has no functional change, but with some minor variable renames,
e.g. rename all "sector" into "block".
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside the main loop of btrfs_buffered_write(), we have a complex data
and metadata space reservation code, which tries to reserve space for
a COW write, if failed then fallback to check if we can do a NOCOW
write.
Factor out that part of code into a dedicated helper, reserve_space(),
to make the main loop a little easier to read.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside the main loop of btrfs_buffered_write(), if something wrong
happened, there is a out-of-loop cleanup path to release the reserved
space.
This behavior saves some code lines, but makes it much harder to read,
as we need to check release_bytes to make sure when we need to do the
cleanup.
Factor out the cleanup part into a helper, release_reserved_space(), to
do the cleanup inside the main loop, so that we can move @release_bytes
inside the loop.
This will make later refactoring of the main loop much easier.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit c87c299776 ("btrfs: make buffered write to copy one page a
time") changed how the variable @force_page_uptodate was updated.
Before that commit the variable was only initialized to false at the
beginning of the function, and after hitting a short copy, the next
retry on the same folio would force the folio to be read from the disk.
But after the commit, the variable is always initialized to false at the
beginning of the loop's scope, causing prepare_one_folio() never to get a
true value passed in.
The change in behavior is not a huge deal, it only makes a difference
on how we handle short copies:
Old: Allow the buffer to be split
The first short copy will be rejected, that's the same for both
cases.
But for the next retry, we require the folio to be read from disk.
Then even if we hit a short copy again, since the folio is already
uptodate, we do not need to handle partial uptodate range, and can
continue, marking the short copied range as dirty and continue.
This will split the buffer write into the folio as two buffered
writes.
New: Do not allow the buffer to be split
The first short copy will be rejected, that's the same for both
cases.
For the next retry, we do nothing special, thus if the short copy
happened again, we reject it again, until either the short copy is
gone, or we failed to fault in the buffer.
This will mean the buffer write into the folio will either fail or
succeed, no splitting will happen.
To me, either solution is fine, but the new one makes it simpler and
requires no special handling, so I prefer that solution.
And since @force_page_uptodate is always false when passed into
prepare_one_folio(), we can just remove the variable.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 1d2fbb7f1f ("btrfs: allow compression even if the range is not
page aligned") introduced the block perfect compression for block size <
page size cases.
Before that commit, if the fs block size is smaller than page size (aka
subpage cases), compressed write is only enabled if the dirty range is
fully page aligned.
This block perfect compression support was introduced in v6.13, and has
been tested for two kernel releases.
I believe it's time to move it out of experimental features so that we
can get more tests in the real world.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmglNBkACgkQxWXV+ddt
WDtOeA/+Ifj7fYP6feVya+KF5qLXg4H0x6p+IpoBhgzOyrRFiBR9yPbOADt3MEX4
ATpG7cHhOd8Mxaegbpz6zArHcZqO1VlPWbl+HpVJ6Ji7+N+u+eiHcSFyUT5yFIl7
HLrJ7bxpc8xVLLsPeBOrk3c7LKkiaeAw4EmuMAY70d0oqaMJ5nqSiYFvLislTETR
DaOoInem16WvjfEwHgXXZcfxxjqc/R8WFW1Tud+jJSkrxSQ/V1viP0G06IGq8ucz
cHx7SM9D/myqoHa/dTwx3DeZglcsYQN5tBk0aW3HkylcXLPueFf70cGxzk1mRUw5
zavKJ31mW73zNJs4hIFQiy2rbfyi7g/LuOFlhNT+AbDRX4HDP88+42anVlQl3VdC
FcKL+VEtY5sgfn4kslsyo4fMbNpt0VXA7wy0qOEmHbHdnBgaYTIjqwu1LUnU/eLJ
WQQstUkfuo+pZffaaKsR7S5r5i5xUzYjqHXF9qf1Dju9rEKYbLVtu/T3EVziO1Mc
vdVE2xxdnuf8UTeJ+gJtcyeUJT54SihaR2qm8tErMdILMjSTPmaAQFhtRV14nQTp
upqsJ5gesbc3++VPPmsBgcLP7UL9uN7s6NeRRanj1Zg2bZY8B+zGwhr8/k1ZmR8T
uMr0qFrYx5SVCS2g47FRK6dWrnYgAdT5LaXA5cx02nTynU2hw1o=
=8C8t
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential endless loop when discarding a block group when
disabling discard
- reinstate message when setting a large value of mount option 'commit'
- fix a folio leak when async extent submission fails
* tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add back warning for mount option commit values exceeding 300
btrfs: fix folio leak in submit_one_async_extent()
btrfs: fix discard worker infinite loop after disabling discard
The Btrfs documentation states that if the commit value is greater than
300 a warning should be issued. The warning was accidentally lost in the
new mount API update.
Fixes: 6941823cc8 ("btrfs: remove old mount API code")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_reserve_extent() fails while submitting an async_extent for a
compressed write, then we fail to call free_async_extent_pages() on the
async_extent and leak its folios. A likely cause for such a failure
would be btrfs_reserve_extent() failing to find a large enough
contiguous free extent for the compressed extent.
I was able to reproduce this by:
1. mount with compress-force=zstd:3
2. fallocating most of a filesystem to a big file
3. fragmenting the remaining free space
4. trying to copy in a file which zstd would generate large compressed
extents for (vmlinux worked well for this)
Step 4. hits the memory leak and can be repeated ad nauseam to
eventually exhaust the system memory.
Fix this by detecting the case where we fallback to uncompressed
submission for a compressed async_extent and ensuring that we call
free_async_extent_pages().
Fixes: 131a821a24 ("btrfs: fallback if compressed IO fails for ENOSPC")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Co-developed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the discard worker is running and there's currently only one block
group, that block group is a data block group, it's in the unused block
groups discard list and is being used (it got an extent allocated from it
after becoming unused), the worker can end up in an infinite loop if a
transaction abort happens or the async discard is disabled (during remount
or unmount for example).
This happens like this:
1) Task A, the discard worker, is at peek_discard_list() and
find_next_block_group() returns block group X;
2) Block group X is in the unused block groups discard list (its discard
index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past
it become an unused block group and was added to that list, but then
later it got an extent allocated from it, so its ->used counter is not
zero anymore;
3) The current transaction is aborted by task B and we end up at
__btrfs_handle_fs_error() in the transaction abort path, where we call
btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from
fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode
(setting SB_RDONLY in the super block's s_flags field);
4) Task A calls __add_to_discard_list() with the goal of moving the block
group from the unused block groups discard list into another discard
list, but at __add_to_discard_list() we end up doing nothing because
btrfs_run_discard_work() returns false, since the super block has
SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set
anymore in fs_info->flags. So block group X remains in the unused block
groups discard list;
5) Task A then does a goto into the 'again' label, calls
find_next_block_group() again we gets block group X again. Then it
repeats the previous steps over and over since there are not other
block groups in the discard lists and block group X is never moved
out of the unused block groups discard list since
btrfs_run_discard_work() keeps returning false and therefore
__add_to_discard_list() doesn't move block group X out of that discard
list.
When this happens we can get a soft lockup report like this:
[71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97]
[71.957] Modules linked in: xfs af_packet rfkill (...)
[71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
[71.957] Tainted: [W]=WARN
[71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
[71.957] Code: c1 01 48 83 (...)
[71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
[71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
[71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
[71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
[71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
[71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
[71.957] FS: 0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000
[71.957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0
[71.957] PKRU: 55555554
[71.957] Call Trace:
[71.957] <TASK>
[71.957] process_one_work+0x17e/0x330
[71.957] worker_thread+0x2ce/0x3f0
[71.957] ? __pfx_worker_thread+0x10/0x10
[71.957] kthread+0xef/0x220
[71.957] ? __pfx_kthread+0x10/0x10
[71.957] ret_from_fork+0x34/0x50
[71.957] ? __pfx_kthread+0x10/0x10
[71.957] ret_from_fork_asm+0x1a/0x30
[71.957] </TASK>
[71.957] Kernel panic - not syncing: softlockup: hung tasks
[71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W L 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
[71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP
[71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[71.992] Call Trace:
[71.993] <IRQ>
[71.994] dump_stack_lvl+0x5a/0x80
[71.994] panic+0x10b/0x2da
[71.995] watchdog_timer_fn.cold+0x9a/0xa1
[71.996] ? __pfx_watchdog_timer_fn+0x10/0x10
[71.997] __hrtimer_run_queues+0x132/0x2a0
[71.997] hrtimer_interrupt+0xff/0x230
[71.998] __sysvec_apic_timer_interrupt+0x55/0x100
[71.999] sysvec_apic_timer_interrupt+0x6c/0x90
[72.000] </IRQ>
[72.000] <TASK>
[72.001] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
[72.002] Code: c1 01 48 83 (...)
[72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
[72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
[72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
[72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
[72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
[72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
[72.010] ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f]
[72.011] process_one_work+0x17e/0x330
[72.012] worker_thread+0x2ce/0x3f0
[72.013] ? __pfx_worker_thread+0x10/0x10
[72.014] kthread+0xef/0x220
[72.014] ? __pfx_kthread+0x10/0x10
[72.015] ret_from_fork+0x34/0x50
[72.015] ? __pfx_kthread+0x10/0x10
[72.016] ret_from_fork_asm+0x1a/0x30
[72.017] </TASK>
[72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[72.019] Rebooting in 90 seconds..
So fix this by making sure we move a block group out of the unused block
groups discard list when calling __add_to_discard_list().
Fixes: 2bee7eb8bb ("btrfs: discard one region at a time in async discard")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace the code building a bio from a kernel direct map address and
submitting it synchronously with the bdev_rw_virt helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgaECkACgkQxWXV+ddt
WDsHeA//SCLb1tlI9LEiOuDP7Dk429caxrQwPU/AXPOoUwGT0rNSjmBDLXfIRFHT
gRmI48huDvuVu00wL+wOY9Xs1M5oMkExsAW8nq08MHM2I+sNx+ppojjM5RgpwwCs
QAASTEu4DOhtYrzJ9SPn0jmK8kDadi3fFSNNIJBd5IjpcLIhNiyryU6l7iXq9f7A
pA3EEg7KL4jvciaOsnqE+/nvAd7oT0OtIRkrzPRKnsjJEg5zZEVo/4hUMhbNHVLC
7CuQB6MR79PoTOW8kZL/636FOQqv0XO+luHZEUf26sTuKiTEHgjq2jBymViDibCy
XNNKCnqTmmYCcN4bqIkdDzM5cPZmOchih7eTUUTlpNH3qmtGn0HVx6pmOS+U6lHI
DFRELbo+ry3LikZ8a7sGNcZQcooq7A7FgxggbI37Nbn0M6FxvmbiwfTDvvn6o04H
+Q7+Sdbklb3MnNCa/ebIq+9XewYIoNXCAqnLJxMIj8OzrBtvPWoI5R3/CGe7MYsf
jvEGHQuSLaw39tBJmrypImkoRocK/4hhHzYpGGQ5FNtbcgTEqHNIi+uIjHJlxQfi
9Tg95o2eK/glg+T3WrG/uviSnz5VbIKdj5Ksjw3evC0ihzX61NljMnPIlWEkAHAZ
AIFnx5aQe1FhN9HQMiGenCYg+QuFsHXX3Qbh+2PW6QHbQ0os9Fg=
=oczg
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- revert device path canonicalization, this does not work as intended
with namespaces and is not reliable in all setups
- fix crash in scrub when checksum tree is not valid, e.g. when mounted
with rescue=ignoredatacsums
- fix crash when tracepoint btrfs_prelim_ref_insert is enabled
- other minor fixups:
- open code folio_index(), meant to be used in MM code
- use matching type for sizeof in compression allocation
* tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: open code folio_index() in btree_clear_folio_dirty_tag()
Revert "btrfs: canonicalize the device path before adding it"
btrfs: avoid NULL pointer dereference if no valid csum tree
btrfs: handle empty eb->folios in num_extent_folios()
btrfs: correct the order of prelim_ref arguments in btrfs__prelim_ref
btrfs: compression: adjust cb->compressed_folios allocation type
The folio_index() helper is only needed for mixed usage of page cache
and swap cache, for pure page cache usage, the caller can just use
folio->index instead.
It can't be a swap cache folio here. Swap mapping may only call into fs
through 'swap_rw' but btrfs does not use that method for swap.
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 7e06de7c83.
Commit 7e06de7c83 ("btrfs: canonicalize the device path before adding
it") tries to make btrfs to use "/dev/mapper/*" name first, then any
filename inside "/dev/" as the device path.
This is mostly fine when there is only the root namespace involved, but
when multiple namespace are involved, things can easily go wrong for the
d_path() usage.
As d_path() returns a file path that is namespace dependent, the
resulted string may not make any sense in another namespace.
Furthermore, the "/dev/" prefix checks itself is not reliable, one can
still make a valid initramfs without devtmpfs, and fill all needed
device nodes manually.
Overall the userspace has all its might to pass whatever device path for
mount, and we are not going to win the war trying to cover every corner
case.
So just revert that commit, and do no extra d_path() based file path
sanity check.
CC: stable@vger.kernel.org # 6.12+
Link: https://lore.kernel.org/linux-fsdevel/20250115185608.GA2223535@zen.localdomain/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When trying read-only scrub on a btrfs with rescue=idatacsums mount
option, it will crash with the following call trace:
BUG: kernel NULL pointer dereference, address: 0000000000000208
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 1 UID: 0 PID: 835 Comm: btrfs Tainted: G O 6.15.0-rc3-custom+ #236 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:btrfs_lookup_csums_bitmap+0x49/0x480 [btrfs]
Call Trace:
<TASK>
scrub_find_fill_first_stripe+0x35b/0x3d0 [btrfs]
scrub_simple_mirror+0x175/0x290 [btrfs]
scrub_stripe+0x5f7/0x6f0 [btrfs]
scrub_chunk+0x9a/0x150 [btrfs]
scrub_enumerate_chunks+0x333/0x660 [btrfs]
btrfs_scrub_dev+0x23e/0x600 [btrfs]
btrfs_ioctl+0x1dcf/0x2f80 [btrfs]
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x4f/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[CAUSE]
Mount option "rescue=idatacsums" will completely skip loading the csum
tree, so that any data read will not find any data csum thus we will
ignore data checksum verification.
Normally call sites utilizing csum tree will check the fs state flag
NO_DATA_CSUMS bit, but unfortunately scrub does not check that bit at all.
This results in scrub to call btrfs_search_slot() on a NULL pointer
and triggered above crash.
[FIX]
Check both extent and csum tree root before doing any tree search.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
num_extent_folios() unconditionally calls folio_order() on
eb->folios[0]. If that is NULL this will be a segfault. It is reasonable
for it to return 0 as the number of folios in the eb when the first
entry is NULL, so do that instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation for making the kmalloc() family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "struct folio **" but the returned type will be
"struct page **". These are the same allocation size (pointer size), but
the types don't match. Adjust the allocation type to match the assignment.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgSLb8ACgkQxWXV+ddt
WDsHZA//cqlq2zGs5dqRYhPFz5wwKqJcRKcJe2ag4x/Du18SJ5ZXMazlYcVfTZ18
7Wo2Bmk5cVUb83u/vbyA01FaqD8pYvEU/fLn6NY4YQfs9AIc/Ek/DexWmjoCe1aF
fxWoPPACl11jm6crUC5U/KtudZhDS4ALtCE+6GrbWamvnbG+BZjxzACzISU4jvHS
BVdXgf9Ogx6hk++b2rhMOsp2C807vnPwFJLwV8CAQQiSzRAlDUMM75P6fduN69if
nR/jxURojEX+x14k4kPO33vVA5ffblB6t15Ws/OtlFEtnU90kJShxTwHvDOgs0B/
d8Iu+9Rt0+vPbMb+GLQZBMCT24n0/67PCEJ0Y7R9y5/4Q65y2paWXihTDQBhJ/YO
GhbajDcRLrZ+WWO3kjrmePyGkz6AxmiAnnE75VcNpYRtO6CT89UhCvxGWCGqBdlr
2G7FY/snCOP1UdL0YyU46OZ7fCMjRpRxSJuDi1jxyrdW2PuOjlQX68LlNbFeERab
QU1QYNlwuck0GrsnVWKaS7lD7wKLPD53kXFUVZfLfTD7qzTzX3nHBxbM/P2dOBeO
0rx1JQdgBTPg60DHwnFRwYRgKGohwpW57/JAadqxy70RkHPquJayqWbkIeIm/4Sp
Kt4yHBGiN2EIHGMxyEAqia7Zrc8GkedC1S6DU7FOn/VWbQyiARM=
=HHoC
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential inode leak in iget() after memory allocation failure
- in subpage mode, fix extent buffer bitmap iteration when writing out
dirty sectors
- fix range calculation when falling back to COW for a NOCOW file
* tag 'for-6.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: adjust subpage bit start based on sectorsize
btrfs: fix the inode leak in btrfs_iget()
btrfs: fix COW handling in run_delalloc_nocow()
When running machines with 64k page size and a 16k nodesize we started
seeing tree log corruption in production. This turned out to be because
we were not writing out dirty blocks sometimes, so this in fact affects
all metadata writes.
When writing out a subpage EB we scan the subpage bitmap for a dirty
range. If the range isn't dirty we do
bit_start++;
to move onto the next bit. The problem is the bitmap is based on the
number of sectors that an EB has. So in this case, we have a 64k
pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4
bits for every node. With a 64k page size we end up with 4 nodes per
page.
To make this easier this is how everything looks
[0 16k 32k 48k ] logical address
[0 4 8 12 ] radix tree offset
[ 64k page ] folio
[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
[ | | | | | | | | | | | | | | | | ] bitmap
Now we use all of our addressing based on fs_info->sectorsize_bits, so
as you can see the above our 16k eb->start turns into radix entry 4.
When we find a dirty range for our eb, we correctly do bit_start +=
sectors_per_node, because if we start at bit 0, the next bit for the
next eb is 4, to correspond to eb->start 16k.
However if our range is clean, we will do bit_start++, which will now
put us offset from our radix tree entries.
In our case, assume that the first time we check the bitmap the block is
not dirty, we increment bit_start so now it == 1, and then we loop
around and check again. This time it is dirty, and we go to find that
start using the following equation
start = folio_start + bit_start * fs_info->sectorsize;
so in the case above, eb->start 0 is now dirty, and we calculate start
as
0 + 1 * fs_info->sectorsize = 4096
4096 >> 12 = 1
Now we're looking up the radix tree for 1, and we won't find an eb.
What's worse is now we're using bit_start == 1, so we do bit_start +=
sectors_per_node, which is now 5. If that eb is dirty we will run into
the same thing, we will look at an offset that is not populated in the
radix tree, and now we're skipping the writeout of dirty extent buffers.
The best fix for this is to not use sectorsize_bits to address nodes,
but that's a larger change. Since this is a fs corruption problem fix
it simply by always using sectors_per_node to increment the start bit.
Fixes: c4aec299fa ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that a syzbot reproducer can lead to the following
busy inode at unmount time:
BTRFS info (device loop1): last unmount of filesystem 1680000e-3c1e-4c46-84b6-56bd3909af50
VFS: Busy inodes after unmount of loop1 (btrfs)
------------[ cut here ]------------
kernel BUG at fs/super.c:650!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 48168 Comm: syz-executor Not tainted 6.15.0-rc2-00471-g119009db2674 #2 PREEMPT(full)
Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:generic_shutdown_super+0x2e9/0x390 fs/super.c:650
Call Trace:
<TASK>
kill_anon_super+0x3a/0x60 fs/super.c:1237
btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2099
deactivate_locked_super+0xbe/0x1a0 fs/super.c:473
deactivate_super fs/super.c:506 [inline]
deactivate_super+0xe2/0x100 fs/super.c:502
cleanup_mnt+0x21f/0x440 fs/namespace.c:1435
task_work_run+0x14d/0x240 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x269/0x290 kernel/entry/common.c:218
do_syscall_64+0xd4/0x250 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
[CAUSE]
When btrfs_alloc_path() failed, btrfs_iget() directly returned without
releasing the inode already allocated by btrfs_iget_locked().
This results the above busy inode and trigger the kernel BUG.
[FIX]
Fix it by calling iget_failed() if btrfs_alloc_path() failed.
If we hit error inside btrfs_read_locked_inode(), it will properly call
iget_failed(), so nothing to worry about.
Although the iget_failed() cleanup inside btrfs_read_locked_inode() is a
break of the normal error handling scheme, let's fix the obvious bug
and backport first, then rework the error handling later.
Reported-by: Penglei Jiang <superman.xpt@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/20250421102425.44431-1-superman.xpt@gmail.com/
Fixes: 7c855e16ab ("btrfs: remove conditional path allocation in btrfs_read_locked_inode()")
CC: stable@vger.kernel.org # 6.13+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Penglei Jiang <superman.xpt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In run_delalloc_nocow(), when the found btrfs_key's offset > cur_offset,
it indicates a gap between the current processing region and
the next file extent. The original code would directly jump to
the "must_cow" label, which increments the slot and forces a fallback
to COW. This behavior might skip an extent item and result in an
overestimated COW fallback range.
This patch modifies the logic so that when a gap is detected:
- If no COW range is already being recorded (cow_start is unset),
cow_start is set to cur_offset.
- cur_offset is then advanced to the beginning of the next extent.
- Instead of jumping to "must_cow", control flows directly to
"next_slot" so that the same extent item can be reexamined properly.
The change ensures that we accurately account for the extent gap and
avoid accidentally extending the range that needs to fallback to COW.
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgHxA4ACgkQxWXV+ddt
WDtK6hAAqoqDYqM/Lt5/CmMJnrtXZBIoxlQUkw3b8946d6LDlmaQb4dOL8+/kuzy
mVhtPf0+WYm4YbchrAHpt2ZLp8s5e9TNbxX88HYJPc2pbjIbuzsnig0Ss7d0OipH
i4RSGxT5Pe0TZRFBQGM1iX+ehFbfOFOPwDBYiHoO9IRakbocZwuPAEAZ/r3v1jVW
YJrbgyF6HQt9/atTMbSO+DERMlCgLmMKQL1f0ciYrTcpAl3ermjV5sSFVFKQZQK7
jSd98NDxwfxAA/30pMFcvDS7SHgB4ZP6YT0CTeTYKQ2OTUgvQRIFCPeAORR4u5IN
n9SCLeFJwmG30zrRaOlSk4/4MHzBzycXr5xJI7TAD7Cko9AYNeWWCFwhbKTu/FxJ
26CGKNXtAOXwiPLwLrUcahok0UDbRmV2/DLrl09ltMvkY/s7hf3zD9WuBaq9DOtk
KlCjgWF/Rk9Qpb4kpLZxJtj9/zaNAyRUQDQH7IzcF4SLHEhf6N6ArhxX0PGhwWwy
B8VBZJz3Y7L8ZxP9R/Y29TO2JCvnIhJCy01Y/zfIXzD7Q4XlcC5fbzt7yoEa4Ogb
HrKG5Rtrq2pn7sUSbXg+Kvpvpqz1tD8Dcx3kQqDqo2LnAI4KSVwyLaBSK66gITv1
TwEqfJDVkt9She2mItc+bssCCm/f3ms7KE7dwdBhf7Y47v+Wjzk=
=+YLw
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- subpage mode fixes:
- access correct object (folio) when looking up bit offset
- fix assertion condition for number of blocks per folio
- fix upper boundary of locking range in hole punch
- zoned fixes:
- fix potential deadlock caught by lockdep when zone reporting and
device freeze run in parallel
- fix zone write pointer mismatch and NULL pointer dereference when
metadata are converted from DUP to RAID1
- fix error handling when reloc inode creation fails
- in tree-checker, unify error code for header level check
- block layer: add helpers to read zone capacity
* tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: skip reporting zone for new block group
block: introduce zone capacity helper
btrfs: tree-checker: adjust error code for header level check
btrfs: fix invalid inode pointer after failure to create reloc inode
btrfs: zoned: return EIO on RAID1 block group write pointer mismatch
btrfs: fix the ASSERT() inside GET_SUBPAGE_BITMAP()
btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()
btrfs: subpage: access correct object when reading bitmap start in subpage_calc_start_bit()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgAzO8ACgkQxWXV+ddt
WDve6g//UWZ24/wLOoFC4u2wwuctnWy5FFOrvk0IqdxWzuSjA1Ou1P4WfD2xlnQv
wFqYk2SIuP68WQhd09Oj1WRQ9SbJIgAwITeryw4lFYq8v1q8xFB5kM0UTLXXlaNH
O342UK7HRW7XfXD9VkcQz5wXQvk0i7pmtZTjiD1QBbWS+qlEc5YQiZnMRlUlQKBw
85JM45iOFwHJLVt+A8ydC1yMdP7xktiVEhlPsjvzqUKs8orquuikxSW5d/WlDc9g
OeOf9pvxSNf3zsAzmwUrEOxsn3fLFFjoaPxDpfn42BsN4FcyIv4l9K9HdkcdzrLY
Gu0QaDVGCb6bXYhioyEzv/mzESQzOTWQUzI2fJrPPquwH9g0dss9uQwOwaOWbfHO
MDF7fBVwnChaC0O8NoKk5H8jQAXxPfAuU1JpypKOORuffTVz7uG3xkK56VJ/kfTh
qgqRImNGTuAu0C0xGdUjngpOfRypDQLQTo58AubLFAWjqD4elOFjanc/6xobYAJi
PnPk132yKxAdR9h4+1YUk1lzaauDinNzErt+vpUQ/g2QL9PtUbp1IG7VF9llGDzO
hqlifRBHcNy7cKNirFX0PYCke8fXrsKC1NbNiAQMjuK7agzg3b/+PW05EFLQv3EU
6CNgukLG8XbfK2F7PMwmno4zUXbA5JA2mxnKr4vRIMrGZVBTcTo=
=HZ/U
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- handle encoded read ioctl returning EAGAIN so it does not mistakenly
free the work structure
- escape subvolume path in mount option list so it cannot be wrongly
parsed when the path contains ","
- remove folio size assertions when writing super block to device with
enabled large folios
* tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: remove folio order ASSERT()s in super block writeback path
btrfs: correctly escape subvol in btrfs_show_options()
btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
The whole tree checker returns EUCLEAN, except the one check in
btrfs_verify_level_key(). This was inherited from the function that was
moved from disk-io.c in 2cac5af165 ("btrfs: move
btrfs_verify_level_key into tree-checker.c") but this should be unified
with the rest.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There was a bug report about a NULL pointer dereference in
__btrfs_add_free_space_zoned() that ultimately happens because a
conversion from the default metadata profile DUP to a RAID1 profile on two
disks.
The stack trace has the following signature:
BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
BUG: kernel NULL pointer dereference, address: 0000000000000058
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:__btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
RSP: 0018:ffffa236b6f3f6d0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff96c8132f3400 RCX: 0000000000000001
RDX: 0000000010000000 RSI: 0000000000000000 RDI: ffff96c8132f3410
RBP: 0000000010000000 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000000
R13: ffff96c758f65a40 R14: 0000000000000001 R15: 000011aac0000000
FS: 00007fdab1cb2900(0000) GS:ffff96e60ca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000058 CR3: 00000001a05ae000 CR4: 0000000000350ef0
Call Trace:
<TASK>
? __die_body.cold+0x19/0x27
? page_fault_oops+0x15c/0x2f0
? exc_page_fault+0x7e/0x180
? asm_exc_page_fault+0x26/0x30
? __btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
btrfs_add_free_space_async_trimmed+0x34/0x40
btrfs_add_new_free_space+0x107/0x120
btrfs_make_block_group+0x104/0x2b0
btrfs_create_chunk+0x977/0xf20
btrfs_chunk_alloc+0x174/0x510
? srso_return_thunk+0x5/0x5f
btrfs_inc_block_group_ro+0x1b1/0x230
btrfs_relocate_block_group+0x9e/0x410
btrfs_relocate_chunk+0x3f/0x130
btrfs_balance+0x8ac/0x12b0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? __kmalloc_cache_noprof+0x14c/0x3e0
btrfs_ioctl+0x2686/0x2a80
? srso_return_thunk+0x5/0x5f
? ioctl_has_perm.constprop.0.isra.0+0xd2/0x120
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x82/0x160
? srso_return_thunk+0x5/0x5f
? __memcg_slab_free_hook+0x11a/0x170
? srso_return_thunk+0x5/0x5f
? kmem_cache_free+0x3f0/0x450
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? syscall_exit_to_user_mode+0x10/0x210
? srso_return_thunk+0x5/0x5f
? do_syscall_64+0x8e/0x160
? sysfs_emit+0xaf/0xc0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? seq_read_iter+0x207/0x460
? srso_return_thunk+0x5/0x5f
? vfs_read+0x29c/0x370
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? syscall_exit_to_user_mode+0x10/0x210
? srso_return_thunk+0x5/0x5f
? do_syscall_64+0x8e/0x160
? srso_return_thunk+0x5/0x5f
? exc_page_fault+0x7e/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fdab1e0ca6d
RSP: 002b:00007ffeb2b60c80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fdab1e0ca6d
RDX: 00007ffeb2b60d80 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 00007ffeb2b60cd0 R08: 0000000000000000 R09: 0000000000000013
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffeb2b6343b R14: 00007ffeb2b60d80 R15: 0000000000000001
</TASK>
CR2: 0000000000000058
---[ end trace 0000000000000000 ]---
The 1st line is the most interesting here:
BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
When a RAID1 block-group is created and a write pointer mismatch between
the disks in the RAID set is detected, btrfs sets the alloc_offset to the
length of the block group marking it as full. Afterwards the code expects
that a balance operation will evacuate the data in this block-group and
repair the problems.
But before this is possible, the new space of this block-group will be
accounted in the free space cache. But in __btrfs_add_free_space_zoned()
it is being checked if it is a initial creation of a block group and if
not a reclaim decision will be made. But the decision if a block-group's
free space accounting is done for an initial creation depends on if the
size of the added free space is the whole length of the block-group and
the allocation offset is 0.
But as btrfs_load_block_group_zone_info() sets the allocation offset to
the zone capacity (i.e. marking the block-group as full) this initial
decision is not met, and the space_info pointer in the 'struct
btrfs_block_group' has not yet been assigned.
Fail creation of the block group and rely on manual user intervention to
re-balance the filesystem.
Afterwards the filesystem can be unmounted, mounted in degraded mode and
the missing device can be removed after a full balance of the filesystem.
Reported-by: 西木野羰基 <yanqiyu01@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAB_b4sBhDe3tscz=duVyhc9hNE+gu=B8CrgLO152uMyanR8BEA@mail.gmail.com/
Fixes: b1934cd606 ("btrfs: zoned: handle broken write pointer on zones")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After enabling large data folios for tests, I hit the ASSERT() inside
GET_SUBPAGE_BITMAP() where blocks_per_folio matches BITS_PER_LONG.
The ASSERT() itself is only based on the original subpage fs block size,
where we have at most 16 blocks per page, thus
"ASSERT(blocks_per_folio < BITS_PER_LONG)".
However the experimental large data folio support will set the max folio
order according to the BITS_PER_LONG, so we can have a case where a large
folio contains exactly BITS_PER_LONG blocks.
So the ASSERT() is too strict, change it to
"ASSERT(blocks_per_folio <= BITS_PER_LONG)" to avoid the false alert.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/004 with 4K fs block size and 64K page size,
sometimes fsstress workload can take 100% CPU for a while, but not long
enough to trigger a 120s hang warning.
[CAUSE]
When such 100% CPU usage happens, btrfs_punch_hole_lock_range() is
always in the call trace.
One example when this problem happens, the function
btrfs_punch_hole_lock_range() got the following parameters:
lock_start = 4096, lockend = 20469
Then we calculate @page_lockstart by rounding up lock_start to page
boundary, which is 64K (page size is 64K).
For @page_lockend, we round down the value towards page boundary, which
result 0. Then since we need to pass an inclusive end to
filemap_range_has_page(), we subtract 1 from the rounded down value,
resulting in (u64)-1.
In the above case, the range is inside the same page, and we do not even
need to call filemap_range_has_page(), not to mention to call it with
(u64)-1 at the end.
This behavior will cause btrfs_punch_hole_lock_range() to busy loop
waiting for irrelevant range to have its pages dropped.
[FIX]
Calculate @page_lockend by just rounding down @lockend, without
decreasing the value by one. So @page_lockend will no longer overflow.
Then exit early if @page_lockend is no larger than @page_lockstart.
As it means either the range is inside the same page, or the two pages
are adjacent already.
Finally only decrease @page_lockend when calling filemap_range_has_page().
Fixes: 0528476b6a ("btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside the macro, subpage_calc_start_bit(), we need to calculate the
offset to the beginning of the folio.
But we're using offset_in_page(), on systems with 4K page size and 4K fs
block size, this means we will always return offset 0 for a large folio,
causing all kinds of errors.
Fix it by using offset_in_folio() instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ/P7QhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyoOAQCynFcS1dWuD27S+SdUREmBjMAoZo5M
zdsIvlPv9KLycgD/QX5lXjW3KIYY6jQ8vHUuLVwfDl/JEp4GJS9dLGU+agg=
=0R1T
-----END PGP SIGNATURE-----
Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC cleanups from Eric Biggers:
"Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crc: remove CONFIG_LIBCRC32C
lib/crc: document all the CRC library kconfig options
lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T
lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF
lib/crc: remove unnecessary prompt for CONFIG_CRC16
lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT
lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
The family of functions:
lookup_one()
lookup_one_unlocked()
lookup_one_positive_unlocked()
appear designed to be used by external clients of the filesystem rather
than by filesystems acting on themselves as the lookup_one_len family
are used.
They are used by:
btrfs/ioctl - which is a user-space interface rather than an internal
activity
exportfs - i.e. from nfsd or the open_by_handle_at interface
overlayfs - at access the underlying filesystems
smb/server - for file service
They should be used by nfsd (more than just the exportfs path) and
cachefs but aren't.
It would help if the documentation didn't claim they should "not be
called by generic code".
Also the path component name is passed as "name" and "len" which are
(confusingly?) separate by the "base". In some cases the len in simply
"strlen" and so passing a qstr using QSTR() would make the calling
clearer.
Other callers do pass separate name and len which are stored in a
struct. Sometimes these are already stored in a qstr, other times it
easily could be.
So this patch changes these three functions to receive a 'struct qstr *',
and improves the documentation.
QSTR_LEN() is added to make it easy to pass a QSTR containing a known
len.
[brauner@kernel.org: take a struct qstr pointer]
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>