The warning -Wflex-array-member-not-at-end was introduced in GCC-14, and
we are getting ready to enable it, globally.
Fix the following warning:
fs/btrfs/send.c:181:24: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
and move the declaration of send_ctx::cur_inode_path to the end.
Notice that struct fs_path contains a flexible array member inline_buf,
but also a padding array and a limit calculated for the usable space of
inline_buf (FS_PATH_INLINE_SIZE). It is not the pattern where flexible
array is in the middle of a structure and could potentially overwrite
other members.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parentheses for the unlikely() annotation were put in the wrong
place so it means that the condition is basically never true and the
bounds checking is skipped.
Fixes: aab9458b9f ("btrfs: tree-checker: add inode extref checks")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the end of btrfs_load_block_group_zone_info() the first thing we do
is to ensure that if the mapping type is not a SINGLE one and there is
no RAID stripe tree, then we return early with an error.
Doing that, though, prevents the code from running the last calls from
this function which are about freeing memory allocated during its
run. Hence, in this case, instead of returning early, we set the ret
value and fall through the rest of the cleanup code.
Fixes: 5906333cc4 ("btrfs: zoned: don't skip block group profile checks on conventional zones")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The intent of btrfs_readahead_expand() was to expand to the length of
the current compressed extent being read. However, "ram_bytes" is *not*
that, in the case where a single physical compressed extent is used for
multiple file extents.
Consider this case with a large compressed extent C and then later two
non-compressed extents N1 and N2 written over C, leaving C1 and C2
pointing to offset/len pairs of C:
[ C ]
[ N1 ][ C1 ][ N2 ][ C2 ]
In such a case, ram_bytes for both C1 and C2 is the full uncompressed
length of C. So starting readahead in C1 will expand the readahead past
the end of C1, past N2, and into C2. This will then expand readahead
again, to C2_start + ram_bytes, way past EOF. First of all, this is
totally undesirable, we don't want to read the whole file in arbitrary
chunks of the large underlying extent if it happens to exist. Secondly,
it results in zeroing the range past the end of C2 up to ram_bytes. This
is particularly unpleasant with fs-verity as it can zero and set
uptodate pages in the verity virtual space past EOF. This incorrect
readahead behavior can lead to verity verification errors, if we iterate
in a way that happens to do the wrong readahead.
Fix this by using em->len for readahead expansion, not em->ram_bytes,
resulting in the expected behavior of stopping readahead at the extent
boundary.
Reported-by: Max Chernoff <git@maxchernoff.ca>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
Fixes: 9e9ff875e4 ("btrfs: use readahead_expand() on compressed extents")
CC: stable@vger.kernel.org # 6.17
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, when building a free space tree at populate_free_space_tree(),
if we are not using the block group tree feature, we always expect to find
block group items (either extent items or a block group item with key type
BTRFS_BLOCK_GROUP_ITEM_KEY) when we search the extent tree with
btrfs_search_slot_for_read(), so we assert that we found an item. However
this expectation is wrong since we can have a new block group created in
the current transaction which is still empty and for which we still have
not added the block group's item to the extent tree, in which case we do
not have any items in the extent tree associated to the block group.
The insertion of a new block group's block group item in the extent tree
happens at btrfs_create_pending_block_groups() when it calls the helper
insert_block_group_item(). This typically is done when a transaction
handle is released, committed or when running delayed refs (either as
part of a transaction commit or when serving tickets for space reservation
if we are low on free space).
So remove the assertion at populate_free_space_tree() even when the block
group tree feature is not enabled and update the comment to mention this
case.
Syzbot reported this with the following stack trace:
BTRFS info (device loop3 state M): rebuilding free space tree
assertion failed: ret == 0 :: 0, in fs/btrfs/free-space-tree.c:1115
------------[ cut here ]------------
kernel BUG at fs/btrfs/free-space-tree.c:1115!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 6352 Comm: syz.3.25 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
RIP: 0010:populate_free_space_tree+0x700/0x710 fs/btrfs/free-space-tree.c:1115
Code: ff ff e8 d3 (...)
RSP: 0018:ffffc9000430f780 EFLAGS: 00010246
RAX: 0000000000000043 RBX: ffff88805b709630 RCX: fea61d0e2e79d000
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffffc9000430f8b0 R08: ffffc9000430f4a7 R09: 1ffff92000861e94
R10: dffffc0000000000 R11: fffff52000861e95 R12: 0000000000000001
R13: 1ffff92000861f00 R14: dffffc0000000000 R15: 0000000000000000
FS: 00007f424d9fe6c0(0000) GS:ffff888125afc000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd78ad212c0 CR3: 0000000076d68000 CR4: 00000000003526f0
Call Trace:
<TASK>
btrfs_rebuild_free_space_tree+0x1ba/0x6d0 fs/btrfs/free-space-tree.c:1364
btrfs_start_pre_rw_mount+0x128f/0x1bf0 fs/btrfs/disk-io.c:3062
btrfs_remount_rw fs/btrfs/super.c:1334 [inline]
btrfs_reconfigure+0xaed/0x2160 fs/btrfs/super.c:1559
reconfigure_super+0x227/0x890 fs/super.c:1076
do_remount fs/namespace.c:3279 [inline]
path_mount+0xd1a/0xfe0 fs/namespace.c:4027
do_mount fs/namespace.c:4048 [inline]
__do_sys_mount fs/namespace.c:4236 [inline]
__se_sys_mount+0x313/0x410 fs/namespace.c:4213
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f424e39066a
Code: d8 64 89 02 (...)
RSP: 002b:00007f424d9fde68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f424d9fdef0 RCX: 00007f424e39066a
RDX: 0000200000000180 RSI: 0000200000000380 RDI: 0000000000000000
RBP: 0000200000000180 R08: 00007f424d9fdef0 R09: 0000000000000020
R10: 0000000000000020 R11: 0000000000000246 R12: 0000200000000380
R13: 00007f424d9fdeb0 R14: 0000000000000000 R15: 00002000000002c0
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
Reported-by: syzbot+884dc4621377ba579a6f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/68dc3dab.a00a0220.102ee.004e.GAE@google.com/
Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
CC: <stable@vger.kernel.org> # 6.1.x: 1961d20f6f: btrfs: fix assertion when building free space tree
CC: <stable@vger.kernel.org> # 6.1.x
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported an ASSERT() triggered inside scrub:
BTRFS info (device loop0): scrub: started on devid 1
assertion failed: !folio_test_partial_kmap(folio) :: 0, in fs/btrfs/scrub.c:697
------------[ cut here ]------------
kernel BUG at fs/btrfs/scrub.c:697!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6077 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
RIP: 0010:scrub_stripe_get_kaddr+0x1bb/0x1c0 fs/btrfs/scrub.c:697
Call Trace:
<TASK>
scrub_bio_add_sector fs/btrfs/scrub.c:932 [inline]
scrub_submit_initial_read+0xf21/0x1120 fs/btrfs/scrub.c:1897
submit_initial_group_read+0x423/0x5b0 fs/btrfs/scrub.c:1952
flush_scrub_stripes+0x18f/0x1150 fs/btrfs/scrub.c:1973
scrub_stripe+0xbea/0x2a30 fs/btrfs/scrub.c:2516
scrub_chunk+0x2a3/0x430 fs/btrfs/scrub.c:2575
scrub_enumerate_chunks+0xa70/0x1350 fs/btrfs/scrub.c:2839
btrfs_scrub_dev+0x6e7/0x10e0 fs/btrfs/scrub.c:3153
btrfs_ioctl_scrub+0x249/0x4b0 fs/btrfs/ioctl.c:3163
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
---[ end trace 0000000000000000 ]---
Which doesn't make much sense, as all the folios we allocated for scrub
should not be highmem.
[CAUSE]
Thankfully syzbot has a detailed kernel config file, showing that
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is set to y.
And that debug option will force all folio_test_partial_kmap() to return
true, to improve coverage on highmem tests.
But in our case we really just want to make sure the folios we allocated
are not highmem (and they are indeed not). Such incorrect result from
folio_test_partial_kmap() is just screwing up everything.
[FIX]
Replace folio_test_partial_kmap() to folio_test_highmem() so that we
won't bother those highmem specific debuging options.
Fixes: 5fbaae4b85 ("btrfs: prepare scrub to support bs > ps cases")
Reported-by: syzbot+bde59221318c592e6346@syzkaller.appspotmail.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With v6.17-rc kernels, btrfs will always set 'ssd' mount option even if
the block device is not a rotating one:
# cat /sys/block/sdd/queue/rotational
1
# cat /etc/fstab:
LABEL=DATA2 /data2 btrfs rw,relatime,space_cache=v2,subvolid=5,subvol=/,nofail,nosuid,nodev 0 0
# mount
[...]
/dev/sdd on /data2 type btrfs (rw,nosuid,nodev,relatime,ssd,space_cache=v2,subvolid=5,subvol=/)
[CAUSE]
The 'ssd' mount option is set by set_device_specific_options(), and it
expects that if there is any rotating device in the btrfs, it will set
fs_devices::rotating.
However after commit bddf57a707 ("btrfs: delay btrfs_open_devices()
until super block is created"), the device opening is delayed until the
super block is created.
But the timing of set_device_specific_options() is still left as is,
this makes the function be called without any device opened.
Since no device is opened, thus fs_devices::rotating will never be set,
making btrfs incorrectly set 'ssd' mount option.
[FIX]
Only call set_device_specific_options() after btrfs_open_devices().
Also only call set_device_specific_options() after a new mount, if we're
mounting a mounted btrfs, there is no need to set the device specific
mount options again.
Reported-by: HAN Yuwei <hrx@bupt.moe>
Link: https://lore.kernel.org/linux-btrfs/C8FF75669DFFC3C5+5f93bf8a-80a0-48a6-81bf-4ec890abc99a@bupt.moe/
Fixes: bddf57a707 ("btrfs: delay btrfs_open_devices() until super block is created")
CC: stable@vger.kernel.org # 6.17
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On 'btrfs_ioctl_qgroup_assign' we first duplicate the argument as
provided by the user, which is kfree'd in the end. But this was not the
case when allocating memory for 'prealloc'. In this case, if it somehow
failed, then the previous code would go directly into calling
'mnt_drop_write_file', without freeing the string duplicated from the
user space.
Fixes: 4addc1ffd6 ("btrfs: qgroup: preallocate memory before adding a relation")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When starting relocation, at reloc_chunk_start(), if we happen to find
the flag BTRFS_FS_RELOC_RUNNING is already set we return an error
(-EINPROGRESS) to the callers, however the callers call reloc_chunk_end()
which will clear the flag BTRFS_FS_RELOC_RUNNING, which is wrong since
relocation was started by another task and still running.
Finding the BTRFS_FS_RELOC_RUNNING flag already set is an unexpected
scenario, but still our current behaviour is not correct.
Fix this by never calling reloc_chunk_end() if reloc_chunk_start() has
returned an error, which is what logically makes sense, since the general
widespread pattern is to have end functions called only if the counterpart
start functions succeeded. This requires changing reloc_chunk_start() to
clear BTRFS_FS_RELOC_RUNNING if there's a pending cancel request.
Fixes: 907d2710d7 ("btrfs: add cancellable chunk relocation support")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is an instance of -Wformat when targeting 32-bit architectures due
to using a 'size_t' specifier (which is 'unsigned int' for 32-bit
platforms) to print PAGE_SIZE:
In file included from fs/btrfs/compression.h:17,
from fs/btrfs/extent_io.h:15,
from fs/btrfs/locking.h:13,
from fs/btrfs/ctree.h:19,
from fs/btrfs/disk-io.c:22:
fs/btrfs/disk-io.c: In function 'open_ctree':
include/linux/kern_levels.h:5:25: error: format '%zu' expects argument of type 'size_t', but argument 4 has type 'long unsigned int' [-Werror=format=]
...
fs/btrfs/disk-io.c:3398:17: note: in expansion of macro 'btrfs_warn'
3398 | btrfs_warn(fs_info,
| ^~~~~~~~~~
PAGE_SIZE is consistently defined as an 'unsigned long' in
include/vsdo/page.h so use '%lu' to clear up the warning.
Fixes: 98077f7f21 ("btrfs: enable experimental bs > ps support")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_encode_fh() does not properly account for the three
cases it handles.
Before writing to the file handle (fh), the function only returns to the
user BTRFS_FID_SIZE_NON_CONNECTABLE (5 dwords, 20 bytes) or
BTRFS_FID_SIZE_CONNECTABLE (8 dwords, 32 bytes).
However, when a parent exists and the root ID of the parent and the
inode are different, the function writes BTRFS_FID_SIZE_CONNECTABLE_ROOT
(10 dwords, 40 bytes).
If *max_len is not large enough, this write goes out of bounds because
BTRFS_FID_SIZE_CONNECTABLE_ROOT is greater than
BTRFS_FID_SIZE_CONNECTABLE originally returned.
This results in an 8-byte out-of-bounds write at
fid->parent_root_objectid = parent_root_id.
A previous attempt to fix this issue was made but was lost.
https://lore.kernel.org/all/4CADAEEC020000780001B32C@vpn.id2.novell.com/
Although this issue does not seem to be easily triggerable, it is a
potential memory corruption bug that should be fixed. This patch
resolves the issue by ensuring the function returns the appropriate size
for all three cases and validates that *max_len is large enough before
writing any data.
Fixes: be6e8dc0ba ("NFS support for btrfs - v3")
CC: stable@vger.kernel.org # 3.0+
Signed-off-by: Anderson Nascimento <anderson@allelesecurity.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After setting the BTRFS_ROOT_FORCE_COW flag on the root we are doing a
full write barrier, smp_wmb(), but we don't need to, all we need is a
smp_mb__after_atomic(). The use of the smp_wmb() is from the old days
when we didn't use a bit and used instead an int field in the root to
signal if cow is forced. After the int field was changed to a bit in
the root's state (flags field), we forgot to update the memory barrier
in create_pending_snapshot() to smp_mb__after_atomic(), but we did the
change in commit_fs_roots() after clearing BTRFS_ROOT_FORCE_COW. That
happened in commit 27cdeb7096 ("Btrfs: use bitfield instead of integer
data type for the some variants in btrfs_root"). On the reader side, in
should_cow_block(), we also use the counterpart smp_mb__before_atomic()
which generates further confusion.
So change the smp_wmb() to smp_mb__after_atomic(). In fact we don't
even need any barrier at all since create_pending_snapshot() is called
in the critical section of a transaction commit and therefore no one
can concurrently join/attach the transaction, or start a new one, until
the transaction is unblocked. By the time someone starts a new transaction
and enters should_cow_block(), a lot of implicit memory barriers already
took place by having acquired several locks such as fs_info->trans_lock
and extent buffer locks on the root node at least. Nevertlheless, for
consistency use smp_mb__after_atomic() after setting the force cow bit
in create_pending_snapshot().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.
Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.
The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EUCLEAN (a corruption) is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Trivial pattern for the auto freeing with goto -> return conversions
if possible.
The following cases are considered trivial in this patch:
1. Cases where there are no operations between btrfs_free_path() and the
function returns.
2. Cases where only simple cleanup operations (such as kfree(), kvfree(),
clear_bit(), and fs_path_free()) are present between
btrfs_free_path() and the function return.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously BTRFS did not look at a device's reported max_open_zones limit,
but starting with commit 04147d8394 ("btrfs: zoned: limit active zones
to max_open_zones"), zoned BTRFS limited the number of concurrently used
block-groups to the number of max_open_zones a device reported, if it
hadn't already reported a number of max_active_zones.
Starting with commit 04147d8394 the number of open zones is treated the
same way as active zones. But this leads to mount failures on filesystems
which have been used before 04147d8394 because too many zones are in an
open state.
Ignore the new limitations on these filesystems, so zones can be finished
or evacuated.
Reported-by: Yuwei Han <hrx@bupt.moe>
Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/
Fixes: 04147d8394 ("btrfs: zoned: limit active zones to max_open_zones")
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As pointed out in the documentation, calling 'kmalloc' with open-coded
arithmetic can lead to unfortunate overflows and this particular way of
using it has been deprecated. Instead, it's preferred to use
'kmalloc_array' in cases where it might apply so an overflow check is
performed.
Note this is an API cleanup and is not fixing any overflows because in
all cases the multipliers are bounded small numbers derived from number
of items in leaves/nodes.
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With all the preparation patches, we're able to finally enable btrfs
block size (sector size) larger than page size support and give it a
full fstests run.
And obviously this new feature is hidden behind experimental flags, and
should not be considered as a core feature yet as btrfs' default block
size is still 4K.
But this is still a feature that will shine in the future where 16K
block sized device are widely adopted.
For now there are some features explicitly disabled:
- Direct IO
This is the most complex part to support, the root reason is we can
not control the pages of iov iter passed in.
User space programs can only ensure the virtual addresses are
contiguous, but have no control on their physical addresses.
Our bs > ps support heavily relies on large folios, and direct IO
memory can easily break it.
So direct IO is disabled and will always fall back to buffered IO.
- RAID56
In theory we can convert RAID56 to use large folios, but it will need
to be converted back to page based if we want to support direct IO in
the future.
So just reject it for now.
- Encoded send
- Encoded read
Both are utilizing btrfs_encoded_read_regular_fill_pages(), and send
is utilizing vmallocated memory.
Unfortunately for vmallocated memory we can not guarantee the minimal
folio order.
For send, it will just always fallback to regular writes, which reads
from page cache and will follow the existing folio order requirement.
- Encoded write
Encoded write itself is allocating pages by themselves, and we can
easily change it to follow the minimal order.
But since encoded read is already disabled, there is no need to only
enable encoded write.
Finally just like what we did for bs < ps support in the past, add a
warning message for bs > ps mounts.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs uses btrfs_bio to handle read/write of logical address, for the
incoming bs > ps support, btrfs has extra requirements:
- One folio must contain at least one fs block
- No fs block can cross folio boundaries
This requirement is not hard to maintain, thanks to the address space's
minimal folio order.
But not all btrfs bios are generated through address space, e.g.
compression and scrub.
To catch possible unaligned bios, introduce a helper,
assert_bbio_alginment(), for each btrfs_bio in btrfs_submit_bbio().
This will check the following things:
- bv_offset is aligned to block size
- bv_len is aligned to block size
With a btrfs bio passing above checks, unless it's empty it will ensure
the requirements for bs > ps support.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG DURING BS > PS TEST]
When running the following script on a btrfs whose block size is larger
than page size, e.g. 8K block size and 4K page size, it will trigger a
kernel BUG:
# mkfs.btrfs -s 8k $dev
# mount $dev $mnt
# mkdir $mnt/dir
# ln -s dir $mnt/link
# ls $mnt/link
The call trace looks like this:
BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing
BTRFS info (device dm-2): checking UUID tree
BTRFS info (device dm-2): enabling ssd optimizations
BTRFS info (device dm-2): enabling free space tree
------------[ cut here ]------------
kernel BUG at /home/adam/linux/include/linux/highmem.h:275!
Oops: invalid opcode: 0000 [#1] SMP
CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G OE 6.17.0-rc4-custom+ #283 PREEMPT(full)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs]
Call Trace:
<TASK>
btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
filemap_read_folio+0x37/0xe0
do_read_cache_folio+0x94/0x3e0
__page_get_link.isra.0+0x20/0x90
page_get_link+0x16/0x40
step_into+0x69b/0x830
path_lookupat+0xa7/0x170
filename_lookup+0xf7/0x200
? set_ptes.isra.0+0x36/0x70
vfs_statx+0x7a/0x160
do_statx+0x63/0xa0
__x64_sys_statx+0x90/0xe0
do_syscall_64+0x82/0xae0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Please note bs > ps support is still under development and the
enablement patch is not even in btrfs development branch.
[CAUSE]
Btrfs reuses its data folio read path to handle symbolic links, as the
symbolic link target is stored as an inline data extent.
But for newly created inodes, btrfs only set the minimal order if the
target inode is a regular file.
Thus for above newly created symbolic link, it doesn't properly respect
the minimal folio order, and triggered the above crash.
[FIX]
Call btrfs_set_inode_mapping_order() unconditionally inside
btrfs_create_new_inode().
For symbolic links this will fix the crash as now the folio will meet
the minimal order.
For regular files this brings no change.
For directory/bdev/char and all the other types of inodes, they won't
go through the data read path, thus no effect either.
Fixes: cc38d178ff ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves:
- Migrate scrub_stripe::pages[] to folios[]
- Use btrfs_alloc_folio_array() and folio_put() to alloc above array.
- Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use
folio interfaces
- Migrate raid56_parity_cache_data_pages() to
raid56_parity_cache_data_folios()
Since scrub is the only caller still using pages.
This helper will copy the folio array contents into rbio::stripe_pages,
with sector uptodate flags updated.
And a new ASSERT() to make sure bs > ps cases will not hit this path.
Since most scrub code is based on kaddr/paddr, the migration itself is
pretty straightforward.
And since we're here, also move the loop to set the
stripe_sectors[].uptodate out of the copy loop.
As we always mark all the sectors as uptodate for the data stripe, it's
easier to do in one go, other than doing it inside the copy loop.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves converting the following functions to use correct folio
sizes/shifts:
- zlib_compress_folios()
- zlib_decompress_bio()
There is a special handling for s390 hardware acceleration.
With bs > ps cases, we can go with 16K block size on s390 (which uses
fixed 4K page size).
In that case we do not need to do the buffer copy as our folio is large
enough for hardware acceleration.
So factor out the s390 specific and folio size check into a helper,
need_special_buffer().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves converting the following functions to use correct folio
sizes/shifts:
- copy_compress_data_to_page()
- lzo_compress_folios()
- lzo_decompress_bio()
Just like zstd, lzo has some extra incorrect usage of kmap_local_folio()
that the offset is always 0.
This will not handle HIGHMEM large folios correctly, but those cases are
already rejected explicitly so it should not cause problems when bs > ps
support is enabled.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves converting the following functions to use proper folio
sizes/shifts:
- zstd_compress_folios()
- zstd_decompress_bio()
The function zstd_decompress() is already using block size correctly
without using page size, thus it needs no modification.
And since zstd compression is calling kmap_local_folio(), the existing
code cannot handle large folios with HIGHMEM, as kmap_local_folio()
requires us to handle one page range each time.
I do not really think it's worth to spend time on some feature that will
be deprecated eventually. So here just add an extra explicit rejection
for bs > ps with HIGHMEM feature enabled kernels.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This includes the following preparation for bs > ps cases:
- Always alloc/free the folio directly if bs > ps
This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus
affecting all compression algorithms.
For btrfs_free_compr_folio() it needs no parameter for now, as we can
use the folio size to skip the caching part.
For now the change is just to passing a @fs_info into the function,
all the folio size assumption is still based on page size.
- Properly zero the last folio in compress_file_range()
Since the compressed folios can be larger than a page, we need to
properly zero the whole folio.
- Use correct folio size for btrfs_add_compressed_bio_folios()
Instead of page size, use the correct folio size.
- Use correct folio size/shift for btrfs_compress_filemap_get_folio()
As we are not only using simple page sized folios anymore.
- Use correct folio size for btrfs_decompress()
There is an ASSERT() making sure the decompressed range is no larger
than a page, which will be triggered for bs > ps cases.
- Skip readahead for compressed pages
Similar to subpage cases.
- Make btrfs_alloc_folio_array() to accept a new @order parameter
- Add a helper to calculate the minimal folio size
All those changes should not affect the existing bs <= ps handling.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With my local branch to enable bs > ps support for btrfs, sometimes I
hit the following ASSERT() inside submit_one_sector():
ASSERT(block_start != EXTENT_MAP_HOLE);
Please note that it's not yet possible to hit this ASSERT() in the wild
yet, as it requires btrfs bs > ps support, which is not even in the
development branch.
But on the other hand, there is also a very low chance to hit above
ASSERT() with bs < ps cases, so this is an existing bug affect not only
the incoming bs > ps support but also the existing bs < ps support.
[CAUSE]
Firstly that ASSERT() means we're trying to submit a dirty block but
without a real extent map nor ordered extent map backing it.
Furthermore with extra debugging, the folio triggering such ASSERT() is
always larger than the fs block size in my bs > ps case.
(8K block size, 4K page size)
After some more debugging, the ASSERT() is trigger by the following
sequence:
extent_writepage()
| We got a 32K folio (4 fs blocks) at file offset 0, and the fs block
| size is 8K, page size is 4K.
| And there is another 8K folio at file offset 32K, which is also
| dirty.
| So the filemap layout looks like the following:
|
| "||" is the filio boundary in the filemap.
| "//| is the dirty range.
|
| 0 8K 16K 24K 32K 40K
| |////////| |//////////////////////||////////|
|
|- writepage_delalloc()
| |- find_lock_delalloc_range() for [0, 8K)
| | Now range [0, 8K) is properly locked.
| |
| |- find_lock_delalloc_range() for [16K, 40K)
| | |- btrfs_find_delalloc_range() returned range [16K, 40K)
| | |- lock_delalloc_folios() locked folio 0 successfully
| | |
| | | The filemap range [32K, 40K) got dropped from filemap.
| | |
| | |- lock_delalloc_folios() failed with -EAGAIN on folio 32K
| | | As the folio at 32K is dropped.
| | |
| | |- loops = 1;
| | |- max_bytes = PAGE_SIZE;
| | |- goto again;
| | | This will re-do the lookup for dirty delalloc ranges.
| | |
| | |- btrfs_find_delalloc_range() called with @max_bytes == 4K
| | | This is smaller than block size, so
| | | btrfs_find_delalloc_range() is unable to return any range.
| | \- return false;
| |
| \- Now only range [0, 8K) has an OE for it, but for dirty range
| [16K, 32K) it's dirty without an OE.
| This breaks the assumption that writepage_delalloc() will find
| and lock all dirty ranges inside the folio.
|
|- extent_writepage_io()
|- submit_one_sector() for [0, 8K)
| Succeeded
|
|- submit_one_sector() for [16K, 24K)
Triggering the ASSERT(), as there is no OE, and the original
extent map is a hole.
Please note that, this also exposed the same problem for bs < ps
support. E.g. with 64K page size and 4K block size.
If we failed to lock a folio, and falls back into the "loops = 1;"
branch, we will re-do the search using 64K as max_bytes.
Which may fail again to lock the next folio, and exit early without
handling all dirty blocks inside the folio.
[FIX]
Instead of using the fixed size PAGE_SIZE as @max_bytes, use
@sectorsize, so that we are ensured to find and lock any remaining
blocks inside the folio.
And since we're here, add an extra ASSERT() to
before calling btrfs_find_delalloc_range() to make sure the @max_bytes is
at least no smaller than a block to avoid false negative.
Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in setting the key's offset to (u64)-1 since we never
use it before setting it to the current transaction's ID. So remove the
assignment of (u64)-1 to the key's offset and move the remainder of the
key initialization close to where it's used.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can annotate btrfs_is_testing() as unlikely since that's the most
expected scenario and it's desirable for the compiler to optimize for
the case we are not running the self tests. So add the annotation to
btrfs_is_testing() and while at it also make it return bool instead of
int.
Also make two of the existing callers use btrfs_is_testing() directly
instead of storing its result in a local variable.
On x86_64 with Debian's gcc 14.2.0-19 this resulted in a very tiny object
code reduction.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1913257 161567 15592 2090416 1fe5b0 fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's quite hard and unreadable the way the rule checks are organized in
should_cow_block(). We have a single if statement that returns 0 (false)
and it checks several conditions, with one them being a negated compound
condition which is particularly hard to reason immediately.
Improve on this by using multiple if statements, each checking a single
condition and returning immediately. Also change the return type from an
integer to a boolean, since all we need is to return true or false.
At least on x86_64 with Debian's gcc 14.2.0-19, this also reduces the
object code size by 64 bytes.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1913327 161567 15592 2090486 1fe5f6 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to store the extent's ram_bytes in two variables,
further more one of them, named 'size', is used only for the extent's end
offset calculation. So remove the 'size' variable and use 'nbytes' only.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment is wrong about the part where it says a prealloc extent does
not contribute to an inode's nbytes - it does. Only holes don't contribute
and that's what we are checking for, as prealloc extents always have a
disk_bytenr different from 0. So fix the comment and re-organize the code
to not set nbytes twice and set it to the extent item's number of bytes
only if it doesn't represent a hole - in case it's a hole we have already
initialized nbytes to 0 when we declared it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the support of bs < ps support, extent_writepage_io() will submit
multiple blocks inside the folio.
But if we hit error submitting one sector, but the next sector can still
be submitted successfully, the function extent_writepage_io() will still
return 0.
This will make btrfs to silently ignore the error without setting error
flag for the filemap.
Fix it by recording the first error hit, and always return that value.
Fixes: 8bf334beb3 ("btrfs: fix double accounting race when extent_writepage_io() failed")
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several sanity checks when inserting or extending items in a btree
that verify we didn't overflow the leaf or access a slot beyond the last
one. These are cases that are never expected to be hit so mark them as
unlikely, allowing the compiler to potentially generate better code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We expect that after attempting to read an extent buffer we had no errors
therefore the extent buffer is up to date, so mark the checks for a not up
to date extent buffer as unlikely and allow the compiler to pontentially
generate better code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are not expecting to ever fail the extent buffer alignment checks, so
mark them as unlikely to allow the compiler to potentially generate more
optimized code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of dereferencing fs_info every time we need to access the node
size, store in a local variable to make the code less verbose and avoid
a line split too.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looking at a leaf dump from the kernel's print-tree implementation is not
so friendly to analyze since key types are printed as numbers. Improve on
this by printing key types as strings that are a diminutive of the macro
names for key types, just like we do in btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code for processing file extent items is quite large and it's better
to have it in a dedicated helper rather than in a huge switch statement,
just like we do in btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are not printing anything about the compression type, so add that
useful information in the same format as btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are advertising the ram_bytes of an inline extent as its data size, but
that is not true for compressed extents. The ram_bytes corresponds to the
uncompressed data size while the data size (compressed data) is given by
btrfs_file_extent_inline_item_len(). So fix this and print both values in
the same format as in btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we don't print anything for extent csum items other than the
generic line with the key, item offset and item size. While one can still
determine the range the extent csum covers by doing a few simple
computations, it makes it more time consuming to analyse a leaf dump.
So add a line that prints information about the range covered by the
checksum using the same format as btrfs-progs. This is useful when
debugging log tree issues since we log extent csum items for new extents.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently don't print information about dir log items (other than the
key, item offset and item size), which is useful to look at when debugging
problems with a log tree. So print their specific information (currently
they only have an end index number) in a format similar to btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we ignore inode extref items, we just print their key, item
offset in the leaf and their size, no information about their content
like the index number, parent inode, name length and name.
Improve on this by printing the index, parent and name length in the same
format as btrfs-progs. Note that we don't print the name, as that would
require some processing and escaping like we do in btrfs-progs, and that
could expose sensitive information for some users in case they share their
dmesg/syslog and it contains a leaf dump. So for now leave names out.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we ignore inode ref items, we just print their key, item offset
in the leaf and their size, no information about their content like the
index number, name length and name.
Improve on this by printing the index and name length in the same format
as btrfs-progs. Note that we don't print the name, as that would require
some processing and escaping like we do in btrfs-progs, and that could
expose sensitive information for some users in case they share their
dmesg/syslog and it contains a leaf dump. So for now leave names out.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we only print the dir items for BTRFS_DIR_ITEM_KEY keys, but
we also have dir items for BTRFS_DIR_INDEX_KEY and BTRFS_XATTR_ITEM_KEY
keys too. So print them for those keys too.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we only print the object id component of the location key from a
dir item and the flags. We are missing the whole key, transid and the name
and data lengths. We are also ignoring the fact that we can have multiple
dir item objects encoded in a single item for a BTRFS_DIR_ITEM_KEY key, so
what we print is only for the first item.
Improve on this by iterating on all dir items and print the missing
information. This is done with the same format as in btrfs-progs, what
we miss is printing the names and data since not only that would require
some processing and escaping like in btrfs-progs, but it would also reveal
information that may be sensitive and users may not want to share that in
case that get a leaf dumped in dmesg.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are not dumping a lot of fields for an inode item which are useful for
debugging whenever we dump a leaf (log replay failure for example), so add
them and make it as close as possible to the print tree implementation in
btrfs-progs (things like converting timespecs to human readable dates and
converting flags to strings are missing since they are not so practical to
do in the kernel).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Like inode refs, inode extrefs have a variable length name, which means
we have to do a proper check to make sure no header nor name can exceed
the item limits.
The check itself is very similar to check_inode_ref(), just a different
structure (btrfs_inode_extref vs btrfs_inode_ref).
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We now have a nodesize_bits member in fs_info so we can index an extent
buffer in the backref cache by node number instead of by sector number.
While this allows for a denser index space with the possibility of using
less maple tree nodes, in practice it's unlikely to hit such benefits
since we currently limit the maximum number of keys in the cache to 128,
so unless all extent buffers are contiguous we are unlikely to see a
memory usage reduction in the backing maple tree due to fewer nodes.
Nevertheless it doesn't cost anything to index by node number and it's
more logical.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>