linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-03-28 10:18:25 +08:00

Author	SHA1	Message	Date
Kyoji Ogasawara	74857fdc5d	btrfs: fix printing of mount info messages for NODATACOW/NODATASUM The NODATASUM message was printed twice by mistake and the NODATACOW was missing from the 'unset' part. Fix the duplication and make the output look the same. Fixes: `eddb1a433f` ("btrfs: add reconfigure callback for fs_context") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 14:08:58 +02:00
Kyoji Ogasawara	b435ab556b	btrfs: restore mount option info messages during mount After the fsconfig migration in 6.8, mount option info messages are no longer displayed during mount operations because btrfs_emit_options() is only called during remount, not during initial mount. Fix this by calling btrfs_emit_options() in btrfs_fill_super() after open_ctree() succeeds. Additionally, prevent log duplication by ensuring btrfs_check_options() handles validation with warn-level and err-level messages, while btrfs_emit_options() provides info-level messages. Fixes: `eddb1a433f` ("btrfs: add reconfigure callback for fs_context") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 14:08:45 +02:00
Kyoji Ogasawara	edf842abe4	btrfs: fix incorrect log message for nobarrier mount option Fix a wrong log message that appears when the "nobarrier" mount option is unset. When "nobarrier" is unset, barrier is actually enabled. However, the log incorrectly stated "turning off barriers". Fixes: `eddb1a433f` ("btrfs: add reconfigure callback for fs_context") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 14:08:45 +02:00
Naohiro Aota	dc61d97b0b	btrfs: fix buffer index in wait_eb_writebacks() The commit `f2cb97ee96` ("btrfs: index buffer_tree using node size") changed the index of buffer_tree from "start >> sectorsize_bits" to "start >> nodesize_bits". However, the change is not applied for wait_eb_writebacks() and caused IO failures by writing in a full zone. Use the index properly. Fixes: `f2cb97ee96` ("btrfs: index buffer_tree using node size") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 14:08:45 +02:00
Naohiro Aota	b1511360c8	btrfs: subpage: keep TOWRITE tag until folio is cleaned btrfs_subpage_set_writeback() calls folio_start_writeback() the first time a folio is written back, and it also clears the PAGECACHE_TAG_TOWRITE tag even if there are still dirty blocks in the folio. This can break ordering guarantees, such as those required by btrfs_wait_ordered_extents(). That ordering breakage leads to a real failure. For example, running generic/464 on a zoned setup will hit the following ASSERT. This happens because the broken ordering fails to flush existing dirty pages before the file size is truncated. assertion failed: !list_empty(&ordered->list) :: 0, in fs/btrfs/zoned.c:1899 ------------[ cut here ]------------ kernel BUG at fs/btrfs/zoned.c:1899! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 1906169 Comm: kworker/u130:2 Kdump: loaded Not tainted 6.16.0-rc6-BTRFS-ZNS+ #554 PREEMPT(voluntary) Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] RIP: 0010:btrfs_finish_ordered_zoned.cold+0x50/0x52 [btrfs] RSP: 0018:ffffc9002efdbd60 EFLAGS: 00010246 RAX: 000000000000004c RBX: ffff88811923c4e0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff827e38b1 RDI: 00000000ffffffff RBP: ffff88810005d000 R08: 00000000ffffdfff R09: ffffffff831051c8 R10: ffffffff83055220 R11: 0000000000000000 R12: ffff8881c2458c00 R13: ffff88811923c540 R14: ffff88811923c5e8 R15: ffff8881c1bd9680 FS: 0000000000000000(0000) GS:ffff88a04acd0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f907c7a918c CR3: 0000000004024000 CR4: 0000000000350ef0 Call Trace: <TASK> ? srso_return_thunk+0x5/0x5f btrfs_finish_ordered_io+0x4a/0x60 [btrfs] btrfs_work_helper+0xf9/0x490 [btrfs] process_one_work+0x204/0x590 ? srso_return_thunk+0x5/0x5f worker_thread+0x1d6/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0x118/0x230 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x205/0x260 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Consider process A calling writepages() with WB_SYNC_NONE. In zoned mode or for compressed writes, it locks several folios for delalloc and starts writing them out. Let's call the last locked folio folio X. Suppose the write range only partially covers folio X, leaving some pages dirty. Process A calls btrfs_subpage_set_writeback() when building a bio. This function call clears the TOWRITE tag of folio X, whose size = 8K and the block size = 4K. It is following state. 0 4K 8K \|/////\|/////\| (flag: DIRTY, tag: DIRTY) <-----> Process A will write this range. Now suppose process B concurrently calls writepages() with WB_SYNC_ALL. It calls tag_pages_for_writeback() to tag dirty folios with PAGECACHE_TAG_TOWRITE. Since folio X is still dirty, it gets tagged. Then, B collects tagged folios using filemap_get_folios_tag() and must wait for folio X to be written before returning from writepages(). 0 4K 8K \|/////\|/////\| (flag: DIRTY, tag: DIRTY\|TOWRITE) However, between tagging and collecting, process A may call btrfs_subpage_set_writeback() and clear folio X's TOWRITE tag. 0 4K 8K \| \|/////\| (flag: DIRTY\|WRITEBACK, tag: DIRTY) As a result, process B won't see folio X in its batch, and returns without waiting for it. This breaks the WB_SYNC_ALL ordering requirement. Fix this by using btrfs_subpage_set_writeback_keepwrite(), which retains the TOWRITE tag. We now manually clear the tag only after the folio becomes clean, via the xas operation. Fixes: `3470da3b7d` ("btrfs: subpage: introduce helpers for writeback status") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 14:08:45 +02:00
Qu Wenruo	1f3d56db69	btrfs: clear TAG_TOWRITE from buffer tree when submitting a tree block [POSSIBLE BUG] After commit `5e121ae687` ("btrfs: use buffer xarray for extent buffer writeback operations"), we have a dedicated xarray for extent buffers, and a lot of tags are migrated to that buffer tree, like PAGECACHE_TAG_TOWRITE/DIRTY/WRITEBACK. This frees us from the limits of page flags, but there is a new asymmetric behavior, we call buffer_tree_tag_for_writeback() to set PAGECACHE_TAG_TOWRITE for the involved ranges, but there is no one to clear that tag. Before that rework, we relied on the page cache tag which was cleared when folio_start_writeback() was called. Although this has its own problems (e.g. the first one calling folio_start_writeback() will clear the tag for the whole page), it at least cleared the tag. But now our real tags are stored in the buffer tree, no one is really clearing the PAGECACHE_TAG_TOWRITE tag now. [FIX] Thankfully this is not going to cause any real bug, but just some inefficiency iterating the extent buffers. As if we hit an extent buffer which is not dirty but still has the PAGECACHE_TAG_TOWRITE tag, lock_extent_buffer_for_io() will skip it so we won't writeback the extent buffer again. To properly fix the inefficiency, just clear the PAGECACHE_TAG_TOWRITE inside lock_extent_buffer_for_io(). There is no error path between lock_extent_buffer_for_io() and write_one_eb(), so we're safe to clear the tag there. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 14:08:45 +02:00
Filipe Manana	f022499f24	btrfs: do not set mtime/ctime to current time when unlinking for log replay If we are doing an unlink for log replay, we are updating the directory's mtime and ctime to the current time, and this is incorrect since it should stay with the mtime and ctime that were set when the directory was logged. This is the same as when adding a link to an inode during log replay (with btrfs_add_link()), where we want the mtime and ctime to be the values that were in place when the inode was logged. This was found with generic/547 using LOAD_FACTOR=20 and TIME_FACTOR=20, where due to large log trees we have longer log replay times and fssum could detect a mismatch of the mtime and ctime of a directory. Fix this by skipping the mtime and ctime update at __btrfs_unlink_inode() if we are in log replay context (just like btrfs_add_link()). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 14:08:44 +02:00
Qu Wenruo	05b3728626	btrfs: clear block dirty if btrfs_writepage_cow_fixup() failed [BUG] If btrfs_writepage_cow_fixup() failed (returning value -EUCLEAN), the block will be kept dirty, but with its corresponding range finished in the ordered extent. Currently that error pattern is only possible for experimental builds, which places extra check to ensure we shouldn't hit a dirty block without a corresponding ordered extent. This means if later a writeback happens again, we can hit the following problems: - ASSERT(block_start != EXTENT_MAP_HOLE) in submit_one_sector() If the original extent map is a hole, then we can hit this case, as the new ordered extent failed, we will drop the new extent map and re-read one from the disk. - DEBUG_WARN() in btrfs_writepage_cow_fixup() This is because we no longer have an ordered extent for those dirty blocks. The original for them is already finished with error. [CAUSE] The function btrfs_writepage_cow_fixup() is not following the regular error handling of writeback. The common practice is to clear the folio dirty, start and finish the writeback for the block. This is normally done by extent_clear_unlock_delalloc() with PAGE_START_WRITEBACK \| PAGE_END_WRITEBACK flags during run_delalloc_range(). So if we keep those failed blocks dirty, they will stay in the page cache and wait for the next writeback. And since the original ordered extent is already finished and removed, depending on the original extent map, we either hit the ASSERT() inside submit_one_sector(), or hit the DEBUG_WARN() in btrfs_writepage_cow_fixup() again (and very ironic). [FIX] Follow the regular error handling to clear the dirty flag for the block range, start and finish writeback for that block range instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 14:08:44 +02:00
Qu Wenruo	4bcd3061e8	btrfs: clear block dirty if submit_one_sector() failed [BUG] If submit_one_sector() failed, the block will be kept dirty, but with their corresponding range finished in the ordered extent. This means if a writeback happens later again, we can hit the following problems: - ASSERT(block_start != EXTENT_MAP_HOLE) in submit_one_sector() If the original extent map is a hole, then we can hit this case, as the new ordered extent failed, we will drop the new extent map and re-read one from the disk. - DEBUG_WARN() in btrfs_writepage_cow_fixup() This is because we no longer have an ordered extent for those dirty blocks. The original for them is already finished with error. [CAUSE] The function submit_one_sector() is not following the regular error handling of writeback. The common practice is to clear the folio dirty, start and finish the writeback for the block. This is normally done by extent_clear_unlock_delalloc() with PAGE_START_WRITEBACK \| PAGE_END_WRITEBACK flags during run_delalloc_range(). So if we keep those failed blocks dirty, they will stay in the page cache and wait for the next writeback. And since the original ordered extent is already finished and removed, depending on the original extent map, we either hit the ASSERT() inside submit_one_sector(), or hit the DEBUG_WARN() in btrfs_writepage_cow_fixup(). [FIX] Follow the regular error handling to clear the dirty flag for the block, start and finish writeback for that block instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 14:08:44 +02:00
Naohiro Aota	04147d8394	btrfs: zoned: limit active zones to max_open_zones When there is no active zone limit, we can technically write into any number of zones at the same time. However, exceeding the max open zones can degrade performance. To prevent this, set the max_active_zones to bdev_max_open_zones() if there is no active zone limit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 12:28:58 +02:00
Naohiro Aota	5c4b93f4c8	btrfs: zoned: fix write time activation failure for metadata block group Since commit `13bb483d32` ("btrfs: zoned: activate metadata block group on write time"), we activate a metadata block group at the write time. If the zone capacity is small enough, we can allocate the entire region before the first write. Then, we hit the btrfs_zoned_bg_is_full() in btrfs_zone_activate() and the activation fails. For a data block group, we activate it at the allocation time and we should check the fullness condition in the caller side. Add, a WARN to check the fullness condition. For a metadata block group, we don't need the fullness check because we activate it at the write time. Instead, activating it once it is written should be invalid. Catch that with a WARN too. Fixes: `13bb483d32` ("btrfs: zoned: activate metadata block group on write time") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 12:28:52 +02:00
Naohiro Aota	daa0fde322	btrfs: zoned: fix data relocation block group reservation btrfs_zoned_reserve_data_reloc_bg() is called on mount and at that point, all data block groups belong to the primary data space_info. So, we don't find anything in the data relocation space_info. Also, the condition "bg->used > 0" can select a block group with full of zone_unusable bytes for the candidate. As we cannot allocate from the block group, it is useless to reserve it as the data relocation block group. Furthermore, because of the space_info separation, we need to migrate the selected block group to the data relocation space_info. If not, the extent allocator cannot use the block group to do the allocation. This commit fixes these three issues. Fixes: e606ff985ec7 ("btrfs: zoned: reserve data_reloc block group on mount") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 12:28:48 +02:00
Johannes Thumshirn	f0ba0e7172	btrfs: zoned: skip ZONE FINISH of conventional zones Don't call ZONE FINISH for conventional zones as this will result in I/O errors. Instead check if the zone that needs finishing is a conventional zone and if yes skip it. Also factor out the actual handling of finishing a single zone into a helper function, as do_zone_finish() is growing ever bigger and the indentations levels are getting higher. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-13 12:28:42 +02:00
Boris Burkov	7b63259618	btrfs: fix iteration bug in __qgroup_excl_accounting() __qgroup_excl_accounting() uses the qgroup iterator machinery to update the account of one qgroups usage for all its parent hierarchy, when we either add or remove a relation and have only exclusive usage. However, there is a small bug there: we loop with an extra iteration temporary qgroup called `cur` but never actually refer to that in the body of the loop. As a result, we redundantly account the same usage to the first qgroup in the list. This can be reproduced in the following way: mkfs.btrfs -f -O squota <dev> mount <dev> <mnt> btrfs subvol create <mnt>/sv dd if=/dev/zero of=<mnt>/sv/f bs=1M count=1 sync btrfs qgroup create 1/100 <mnt> btrfs qgroup create 2/200 <mnt> btrfs qgroup assign 1/100 2/200 <mnt> btrfs qgroup assign 0/256 1/100 <mnt> btrfs qgroup show <mnt> and the broken result is (note the 2MiB on 1/100 and 0Mib on 2/100): Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 1.02MiB 1.02MiB sv Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 1.02MiB 1.02MiB sv 1/100 2.03MiB 2.03MiB 2/100<1 member qgroup> 2/100 0.00B 0.00B <0 member qgroups> With this fix, which simply re-uses `qgroup` as the iteration variable, we see the expected result: Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 1.02MiB 1.02MiB sv Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 1.02MiB 1.02MiB sv 1/100 1.02MiB 1.02MiB 2/100<1 member qgroup> 2/100 1.02MiB 1.02MiB <0 member qgroups> The existing fstests did not exercise two layer inheritance so this bug was missed. I intend to add that testing there, as well. Fixes: `a0bdc04b07` ("btrfs: qgroup: use qgroup_iterator in __qgroup_excl_accounting()") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-07 17:07:16 +02:00
Naohiro Aota	3a931e9b39	btrfs: zoned: do not select metadata BG as finish target We call btrfs_zone_finish_one_bg() to zone finish one block group and make room to activate another block group. Currently, we can choose a metadata block group as a target. But, as we reserve an active metadata block group, we no longer want to select a metadata block group. So, skip it in the loop. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-07 17:07:16 +02:00
Qu Wenruo	4289b494ac	btrfs: do not allow relocation of partially dropped subvolumes [BUG] There is an internal report that balance triggered transaction abort, with the following call trace: item 85 key (594509824 169 0) itemoff 12599 itemsize 33 extent refs 1 gen 197740 flags 2 ref#0: tree block backref root 7 item 86 key (594558976 169 0) itemoff 12566 itemsize 33 extent refs 1 gen 197522 flags 2 ref#0: tree block backref root 7 ... BTRFS error (device loop0): extent item not found for insert, bytenr 594526208 num_bytes 16384 parent 449921024 root_objectid 934 owner 1 offset 0 BTRFS error (device loop0): failed to run delayed ref for logical 594526208 num_bytes 16384 type 182 action 1 ref_mod 1: -117 ------------[ cut here ]------------ BTRFS: Transaction aborted (error -117) WARNING: CPU: 1 PID: 6963 at ../fs/btrfs/extent-tree.c:2168 btrfs_run_delayed_refs+0xfa/0x110 [btrfs] And btrfs check doesn't report anything wrong related to the extent tree. [CAUSE] The cause is a little complex, firstly the extent tree indeed doesn't have the backref for 594526208. The extent tree only have the following two backrefs around that bytenr on-disk: item 65 key (594509824 METADATA_ITEM 0) itemoff 13880 itemsize 33 refs 1 gen 197740 flags TREE_BLOCK tree block skinny level 0 (176 0x7) tree block backref root CSUM_TREE item 66 key (594558976 METADATA_ITEM 0) itemoff 13847 itemsize 33 refs 1 gen 197522 flags TREE_BLOCK tree block skinny level 0 (176 0x7) tree block backref root CSUM_TREE But the such missing backref item is not an corruption on disk, as the offending delayed ref belongs to subvolume 934, and that subvolume is being dropped: item 0 key (934 ROOT_ITEM 198229) itemoff 15844 itemsize 439 generation 198229 root_dirid 256 bytenr 10741039104 byte_limit 0 bytes_used 345571328 last_snapshot 198229 flags 0x1000000000001(RDONLY) refs 0 drop_progress key (206324 EXTENT_DATA 2711650304) drop_level 2 level 2 generation_v2 198229 And that offending tree block 594526208 is inside the dropped range of that subvolume. That explains why there is no backref item for that bytenr and why btrfs check is not reporting anything wrong. But this also shows another problem, as btrfs will do all the orphan subvolume cleanup at a read-write mount. So half-dropped subvolume should not exist after an RW mount, and balance itself is also exclusive to subvolume cleanup, meaning we shouldn't hit a subvolume half-dropped during relocation. The root cause is, there is no orphan item for this subvolume. In fact there are 5 subvolumes from around 2021 that have the same problem. It looks like the original report has some older kernels running, and caused those zombie subvolumes. Thankfully upstream commit `8d488a8c7b` ("btrfs: fix subvolume/snapshot deletion not triggered on mount") has long fixed the bug. [ENHANCEMENT] For repairing such old fs, btrfs-progs will be enhanced. Considering how delayed the problem will show up (at run delayed ref time) and at that time we have to abort transaction already, it is too late. Instead here we reject any half-dropped subvolume for reloc tree at the earliest time, preventing confusion and extra time wasted on debugging similar bugs. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-07 17:07:15 +02:00
Filipe Manana	fc5799986f	btrfs: error on missing block group when unaccounting log tree extent buffers Currently we only log an error message if we can't find the block group for a log tree extent buffer when unaccounting it (while freeing a log tree). A missing block group means something is seriously wrong and we end up leaking space from the metadata space info. So return -ENOENT in case we don't find the block group. CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-07 17:07:15 +02:00
Qu Wenruo	deaf895212	btrfs: fix wrong length parameter for btrfs_cleanup_ordered_extents() Inside nocow_one_range(), if the checksum cloning for data reloc inode failed, we call btrfs_cleanup_ordered_extents() to cleanup the just allocated ordered extents. But unlike extent_clear_unlock_delalloc(), btrfs_cleanup_ordered_extents() requires a length, not an inclusive end bytenr. This can be problematic, as the @end is normally way larger than @len. This means btrfs_cleanup_ordered_extents() can be called on folios out of the correct range, and if the out-of-range folio is under writeback, we can incorrectly clear the ordered flag of the folio, and trigger the DEBUG_WARN() inside btrfs_writepage_cow_fixup(). Fix the wrong parameter with correct length instead. Fixes: `94f6c5c17e` ("btrfs: move ordered extent cleanup to where they are allocated") CC: stable@vger.kernel.org # 6.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-07 17:07:15 +02:00
Qu Wenruo	15fc0bec88	btrfs: make btrfs_cleanup_ordered_extents() support large folios When hitting a large folio, btrfs_cleanup_ordered_extents() will get the same large folio multiple times, and clearing the same range again and again. Thankfully this is not causing anything wrong, just inefficiency. This is caused by the fact that we're iterating folios using the old page index, thus can hit the same large folio again and again. Enhance it by increasing @index to the index of the folio end, and only increase @index by 1 if we failed to grab a folio. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-07 17:07:15 +02:00
Leo Martins	ad580dfa38	btrfs: fix subpage deadlock in try_release_subpage_extent_buffer() There is a potential deadlock that can happen in try_release_subpage_extent_buffer() because the irq-safe xarray spin lock fs_info->buffer_tree is being acquired before the irq-unsafe eb->refs_lock. This leads to the potential race: // T1 (random eb->refs user) // T2 (release folio) spin_lock(&eb->refs_lock); // interrupt end_bbio_meta_write() btrfs_meta_folio_clear_writeback() btree_release_folio() folio_test_writeback() //false try_release_extent_buffer() try_release_subpage_extent_buffer() xa_lock_irq(&fs_info->buffer_tree) spin_lock(&eb->refs_lock); // blocked; held by T1 buffer_tree_clear_mark() xas_lock_irqsave() // blocked; held by T2 I believe that the spin lock can safely be replaced by an rcu_read_lock. The xa_for_each loop does not need the spin lock as it's already internally protected by the rcu_read_lock. The extent buffer is also protected by the rcu_read_lock so it won't be freed before we take the eb->refs_lock and check the ref count. The rcu_read_lock is taken and released every iteration, just like the spin lock, which means we're not protected against concurrent insertions into the xarray. This is fine because we rely on folio->private to detect if there are any ebs remaining in the folio. There is already some precedent for this with find_extent_buffer_nolock, which loads an extent buffer from the xarray with only rcu_read_lock. lockdep warning: ===================================================== WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected 6.16.0-0_fbk701_debug_rc0_123_g4c06e63b9203 #1 Tainted: G E N ----------------------------------------------------- kswapd0/66 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: ffff000011ffd600 (&eb->refs_lock){+.+.}-{3:3}, at: try_release_extent_buffer+0x18c/0x560 and this task is already holding: ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560 which would create a new lock dependency: (&buffer_xa_class){-.-.}-{3:3} -> (&eb->refs_lock){+.+.}-{3:3} but this new dependency connects a HARDIRQ-irq-safe lock: (&buffer_xa_class){-.-.}-{3:3} ... which became HARDIRQ-irq-safe at: lock_acquire+0x178/0x358 _raw_spin_lock_irqsave+0x60/0x88 buffer_tree_clear_mark+0xc4/0x160 end_bbio_meta_write+0x238/0x398 btrfs_bio_end_io+0x1f8/0x330 btrfs_orig_write_end_io+0x1c4/0x2c0 bio_endio+0x63c/0x678 blk_update_request+0x1c4/0xa00 blk_mq_end_request+0x54/0x88 virtblk_request_done+0x124/0x1d0 blk_mq_complete_request+0x84/0xa0 virtblk_done+0x130/0x238 vring_interrupt+0x130/0x288 __handle_irq_event_percpu+0x1e8/0x708 handle_irq_event+0x98/0x1b0 handle_fasteoi_irq+0x264/0x7c0 generic_handle_domain_irq+0xa4/0x108 gic_handle_irq+0x7c/0x1a0 do_interrupt_handler+0xe4/0x148 el1_interrupt+0x30/0x50 el1h_64_irq_handler+0x14/0x20 el1h_64_irq+0x6c/0x70 _raw_spin_unlock_irq+0x38/0x70 __run_timer_base+0xdc/0x5e0 run_timer_softirq+0xa0/0x138 handle_softirqs.llvm.13542289750107964195+0x32c/0xbd0 ____do_softirq.llvm.17674514681856217165+0x18/0x28 call_on_irq_stack+0x24/0x30 __irq_exit_rcu+0x164/0x430 irq_exit_rcu+0x18/0x88 el1_interrupt+0x34/0x50 el1h_64_irq_handler+0x14/0x20 el1h_64_irq+0x6c/0x70 arch_local_irq_enable+0x4/0x8 do_idle+0x1a0/0x3b8 cpu_startup_entry+0x60/0x80 rest_init+0x204/0x228 start_kernel+0x394/0x3f0 __primary_switched+0x8c/0x8958 to a HARDIRQ-irq-unsafe lock: (&eb->refs_lock){+.+.}-{3:3} ... which became HARDIRQ-irq-unsafe at: ... lock_acquire+0x178/0x358 _raw_spin_lock+0x4c/0x68 free_extent_buffer_stale+0x2c/0x170 btrfs_read_sys_array+0x1b0/0x338 open_ctree+0xeb0/0x1df8 btrfs_get_tree+0xb60/0x1110 vfs_get_tree+0x8c/0x250 fc_mount+0x20/0x98 btrfs_get_tree+0x4a4/0x1110 vfs_get_tree+0x8c/0x250 do_new_mount+0x1e0/0x6c0 path_mount+0x4ec/0xa58 __arm64_sys_mount+0x370/0x490 invoke_syscall+0x6c/0x208 el0_svc_common+0x14c/0x1b8 do_el0_svc+0x4c/0x60 el0_svc+0x4c/0x160 el0t_64_sync_handler+0x70/0x100 el0t_64_sync+0x168/0x170 other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&eb->refs_lock); local_irq_disable(); lock(&buffer_xa_class); lock(&eb->refs_lock); <Interrupt> lock(&buffer_xa_class); * DEADLOCK * 2 locks held by kswapd0/66: #0: ffff800085506e40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xe8/0xe50 #1: ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560 Link: https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst#:~:text=Multi%2Dlock%20dependency%20rules%3A Fixes: `19d7f65f03` ("btrfs: convert the buffer_radix to an xarray") CC: stable@vger.kernel.org # 6.16+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-07 17:07:15 +02:00
Filipe Manana	0a32e4f002	btrfs: fix log tree replay failure due to file with 0 links and extents If we log a new inode (not persisted in a past transaction) that has 0 links and extents, then log another inode with an higher inode number, we end up with failing to replay the log tree with -EINVAL. The steps for this are: 1) create new file A 2) write some data to file A 3) open an fd on file A 4) unlink file A 5) fsync file A using the previously open fd 6) create file B (has higher inode number than file A) 7) fsync file B 8) power fail before current transaction commits Now when attempting to mount the fs, the log replay will fail with -ENOENT at replay_one_extent() when attempting to replay the first extent of file A. The failure comes when trying to open the inode for file A in the subvolume tree, since it doesn't exist. Before commit `5f61b96159` ("btrfs: fix inode lookup error handling during log replay"), the returned error was -EIO instead of -ENOENT, since we converted any errors when attempting to read an inode during log replay to -EIO. The reason for this is that the log replay procedure fails to ignore the current inode when we are at the stage LOG_WALK_REPLAY_ALL, our current inode has 0 links and last inode we processed in the previous stage has a non 0 link count. In other words, the issue is that at replay_one_extent() we only update wc->ignore_cur_inode if the current replay stage is LOG_WALK_REPLAY_INODES. Fix this by updating wc->ignore_cur_inode whenever we find an inode item regardless of the current replay stage. This is a simple solution and easy to backport, but later we can do other alternatives like avoid logging extents or inode items other than the inode item for inodes with a link count of 0. The problem with the wc->ignore_cur_inode logic has been around since commit `f2d72f42d5` ("Btrfs: fix warning when replaying log after fsync of a tmpfile") but it only became frequent to hit since the more recent commit `5e85262e54` ("btrfs: fix fsync of files with no hard links not persisting deletion"), because we stopped skipping inodes with a link count of 0 when logging, while before the problem would only be triggered if trying to replay a log tree created with an older kernel which has a logged inode with 0 links. A test case for fstests will be submitted soon. Reported-by: Peter Jung <ptr1337@cachyos.org> Link: https://lore.kernel.org/linux-btrfs/fce139db-4458-4788-bb97-c29acf6cb1df@cachyos.org/ Reported-by: burneddi <burneddi@protonmail.com> Link: https://lore.kernel.org/linux-btrfs/lh4W-Lwc0Mbk-QvBhhQyZxf6VbM3E8VtIvU3fPIQgweP_Q1n7wtlUZQc33sYlCKYd-o6rryJQfhHaNAOWWRKxpAXhM8NZPojzsJPyHMf2qY=@protonmail.com/#t Reported-by: Russell Haley <yumpusamongus@gmail.com> Link: https://lore.kernel.org/linux-btrfs/598ecc75-eb80-41b3-83c2-f2317fbb9864@gmail.com/ Fixes: `f2d72f42d5` ("Btrfs: fix warning when replaying log after fsync of a tmpfile") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-06 13:01:38 +02:00
Filipe Manana	005b0a0c24	btrfs: send: use fallocate for hole punching with send stream v2 Currently holes are sent as writes full of zeroes, which results in unnecessarily using disk space at the receiving end and increasing the stream size. In some cases we avoid sending writes of zeroes, like during a full send operation where we just skip writes for holes. But for some cases we fill previous holes with writes of zeroes too, like in this scenario: 1) We have a file with a hole in the range [2M, 3M), we snapshot the subvolume and do a full send. The range [2M, 3M) stays as a hole at the receiver since we skip sending write commands full of zeroes; 2) We punch a hole for the range [3M, 4M) in our file, so that now it has a 2M hole in the range [2M, 4M), and snapshot the subvolume. Now if we do an incremental send, we will send write commands full of zeroes for the range [2M, 4M), removing the hole for [2M, 3M) at the receiver. We could improve cases such as this last one by doing additional comparisons of file extent items (or their absence) between the parent and send snapshots, but that's a lot of code to add plus additional CPU and IO costs. Since the send stream v2 already has a fallocate command and btrfs-progs implements a callback to execute fallocate since the send stream v2 support was added to it, update the kernel to use fallocate for punching holes for V2+ streams. Test coverage is provided by btrfs/284 which is a version of btrfs/007 that exercises send stream v2 instead of v1, using fsstress with random operations and fssum to verify file contents. Link: https://github.com/kdave/btrfs-progs/issues/1001 CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:23:14 +02:00
Filipe Manana	55fae08a06	btrfs: unfold transaction aborts when writing dirty block groups We have a single transaction abort call that can be due to an error from one of two calls to update_block_group_item(). Unfold the transaction abort calls so that if they happen we know which update_block_group_item() call failed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:14:08 +02:00
Filipe Manana	3a074cc659	btrfs: use saner variable type and name to indicate extrefs at add_inode_ref() We are using a variable named 'log_ref_ver' of type int to indicate if we are processing an extref item or not, using a value of 1 if so, otherwise 0. This is an odd name and type, so rename it to 'is_extref_item' and change its type to bool. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:14:07 +02:00
Filipe Manana	24e066ded4	btrfs: don't skip remaining extrefs if dir not found during log replay During log replay, at add_inode_ref(), if we have an extref item that contains multiple extrefs and one of them points to a directory that does not exist in the subvolume tree, we are supposed to ignore it and process the remaining extrefs encoded in the extref item, since each extref can point to a different parent inode. However when that happens we just return from the function and ignore the remaining extrefs. The problem has been around since extrefs were introduced, in commit `f186373fef` ("btrfs: extended inode refs"), but it's hard to hit in practice because getting extref items encoding multiple extref requires getting a hash collision when computing the offset of the extref's key. The offset if computed like this: key.offset = btrfs_extref_hash(dir_ino, name->name, name->len); and btrfs_extref_hash() is just a wrapper around crc32c(). Fix this by moving to next iteration of the loop when we don't find the parent directory that an extref points to. Fixes: `f186373fef` ("btrfs: extended inode refs") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:14:02 +02:00
Filipe Manana	7ebf381a69	btrfs: don't ignore inode missing when replaying log tree During log replay, at add_inode_ref(), we return -ENOENT if our current inode isn't found on the subvolume tree or if a parent directory isn't found. The error comes from btrfs_iget_logging() <- btrfs_iget() <- btrfs_read_locked_inode(). The single caller of add_inode_ref(), replay_one_buffer(), ignores an -ENOENT error because it expects that error to mean only that a parent directory wasn't found and that is ok. Before commit `5f61b96159` ("btrfs: fix inode lookup error handling during log replay") we were converting any error when getting a parent directory to -ENOENT and any error when getting the current inode to -EIO, so our caller would fail log replay in case we can't find the current inode. After that commit however in case the current inode is not found we return -ENOENT to the caller and therefore it ignores the critical fact that the current inode was not found in the subvolume tree. Fix this by converting -ENOENT to 0 when we don't find a parent directory, returning -ENOENT when we don't find the current inode and making the caller, replay_one_buffer(), not ignore -ENOENT anymore. Fixes: `5f61b96159` ("btrfs: fix inode lookup error handling during log replay") CC: stable@vger.kernel.org # 6.16 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:13:40 +02:00
Qu Wenruo	041c39da53	btrfs: enable large data folios for data reloc inode For data reloc inodes, they are a special type of inodes that are not exposed to user space, and are only utilized during data block groups relocation. They do not go under regular read-write operations, but have their file extents manually created to have the same layout of a block group, then its content is read from the original block group, and written back to the new location which is in a new block group. Previously all the handling was done in page units, and commit `c283289812` ("btrfs: make relocate_one_page() handle subpage case") changed the handling to subpage blocks. On the other hand, data reloc inodes are a perfect match for large data folios, as each relocation cluster represents one or more data extents that are contiguous in their logical addresses. This patch enables large folios for data reloc inodes by: - Remove the special handling of data reloc inodes when setting folio order - Change relocate_one_folio() to return the file offset of the next folio Originally it's designed to handle fixed page sized blocks, but with large folios, we can handle a large folio, thus we have to return the end of the current folio. - Remove the warning on folio_order() - Use folio_size() to replace fixed PAGE_SIZE usage - Use file_offset as iterator inside relocate_file_extent_cluster Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:13:03 +02:00
Qu Wenruo	cec780a139	btrfs: output more info when btrfs_subpage_assert() failed The function btrfs_subpage_assert() is a very commonly utilized assert to make sure the range passed in is correct inside the folio. And when some code is not properly subpage/large folio compatible btrfs_subpage_assert() will be the first to be triggered. E.g. when I incorrectly enabled large folios for data reloc inodes, it immediately triggered btrfs_subpage_assert(). In that case, outputting all the involved members will be very helpful, this includes: - start - len - folio position inside the mapping - folio size Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:13:03 +02:00
Qu Wenruo	4e346baee9	btrfs: reloc: unconditionally invalidate the page cache for each cluster Commit `9d9ea1e68a` ("btrfs: subpage: fix relocation potentially overwriting last page data") fixed a bug when relocating data block groups for subpage cases. However for the incoming large folios for data reloc inode, we can hit the same situation where block size is the same as page size, but the folio we got is still larger than a block. In that case, the old subpage specific check is no longer reliable. Here we have to enhance the handling by: - Unconditionally invalidate the page cache for the current cluster We set the @flush to true so that any dirty folios are properly written back first. And this time instead of dropping the whole page cache, just drop the range covered by the current cluster. This will bring some minor performance drop, as for a large folio, the heading half will be read twice (read by previous cluster, then invalidated, then read again by the current cluster). However that is required to support large folios, and this gets rid of the kinda tricky manual uptodate flag clearing for each block. - Remove the special handling of writing back the whole page cache filemap_invalidate_inode() handles the write back already, and since we're invalidating all pages in the range, we no longer need to manually clear the uptodate flags for involved blocks. Thus there is no need to manually write back the whole page cache. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:13:03 +02:00
David Sterba	009b2056cb	btrfs: defrag: add flag to force no-compression Currently the defrag ioctl cannot rewrite the extents without compression. Add a new flag for that, as setting compression to 0 (or "no compression") means to do no changes to compression so take what is the current default, like mount options or properties. The defrag setting overrides mount or properties. The compression BTRFS_DEFRAG_DONT_COMPRESS is only used for in-memory operations and does not need to have a fixed value. Mount with zstd:9, copy test file from /usr/bin/ (about 260KB): $ mount -o compress=zstd:9 /dev/vda /mnt $ filefrag -vsb testfile filefrag: -b needs a blocksize option, assuming 1024-byte blocks. Filesystem type is: 9123683e File size of testfile is 297704 (292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 13312.. 13439: 128: encoded 1: 128.. 255: 13364.. 13491: 128: 13440: encoded 2: 256.. 291: 13424.. 13459: 36: 13492: last,encoded,eof testfile: 3 extents found $ compsize testfile Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments. Type Perc Disk Usage Uncompressed Referenced TOTAL 42% 124K 292K 292K zstd 42% 124K 292K 292K Defrag to uncompressed: $ btrfs fi defrag --nocomp testfile $ filefrag -vsb testfile filefrag: -b needs a blocksize option, assuming 1024-byte blocks. Filesystem type is: 9123683e File size of testfile is 297704 (292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 291: 291840.. 292131: 292: last,eof testfile: 1 extent found $ compsize testfile Processed 1 file, 1 regular extents (1 refs), 0 inline, 1 fragments. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 292K 292K 292K none 100% 292K 292K 292K Compress again with LZO: $ btrfs fi defrag -clzo testfile $ filefrag -vsb testfile filefrag: -b needs a blocksize option, assuming 1024-byte blocks. Filesystem type is: 9123683e File size of testfile is 297704 (292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 13312.. 13439: 128: encoded 1: 128.. 255: 13392.. 13519: 128: 13440: encoded 2: 256.. 291: 13480.. 13515: 36: 13520: last,encoded,eof testfile: 3 extents found $ compsize testfile Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments. Type Perc Disk Usage Uncompressed Referenced TOTAL 64% 188K 292K 292K lzo 64% 188K 292K 292K Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:13:03 +02:00
Boris Burkov	807d9023e7	btrfs: fix ssd_spread overallocation If the ssd_spread mount option is enabled, then we run the so called clustered allocator for data block groups. In practice, this results in creating a btrfs_free_cluster which caches a block_group and borrows its free extents for allocation. Since the introduction of allocation size classes in 6.1, there has been a bug in the interaction between that feature and ssd_spread. find_free_extent() has a number of nested loops. The loop going over the allocation stages, stored in ffe_ctl->loop and managed by find_free_extent_update_loop(), the loop over the raid levels, and the loop over all the block_groups in a space_info. The size class feature relies on the block_group loop to ensure it gets a chance to see a block_group of a given size class. However, the clustered allocator uses the cached cluster block_group and breaks that loop. Each call to do_allocation() will really just go back to the same cached block_group. Normally, this is OK, as the allocation either succeeds and we don't want to loop any more or it fails, and we clear the cluster and return its space to the block_group. But with size classes, the allocation can succeed, then later fail, outside of do_allocation() due to size class mismatch. That latter failure is not properly handled due to the highly complex multi loop logic. The result is a painful loop where we continue to allocate the same num_bytes from the cluster in a tight loop until it fails and releases the cluster and lets us try a new block_group. But by then, we have skipped great swaths of the available block_groups and are likely to fail to allocate, looping the outer loop. In pathological cases like the reproducer below, the cached block_group is often the very last one, in which case we don't perform this tight bg loop but instead rip through the ffe stages to LOOP_CHUNK_ALLOC and allocate a chunk, which is now the last one, and we enter the tight inner loop until an allocation failure. Then allocation succeeds on the final block_group and if the next allocation is a size mismatch, the exact same thing happens again. Triggering this is as easy as mounting with -o ssd_spread and then running: mount -o ssd_spread $dev $mnt dd if=/dev/zero of=$mnt/big bs=16M count=1 &>/dev/null dd if=/dev/zero of=$mnt/med bs=4M count=1 &>/dev/null sync if you do the two writes + sync in a loop, you can force btrfs to spin an excessive amount on semi-successful clustered allocations, before ultimately failing and advancing to the stage where we force a chunk allocation. This results in 2G of data allocated per iteration, despite only using ~20M of data. By using a small size classed extent, the inner loop takes longer and we can spin for longer. The simplest, shortest term fix to unbreak this is to make the clustered allocator size_class aware in the dumbest way, where it fails on size class mismatch. This may hinder the operation of the clustered allocator, but better hindered than completely broken and terribly overallocating. Further re-design improvements are also in the works. Fixes: `52bb7a2166` ("btrfs: introduce size class to block group allocator") CC: stable@vger.kernel.org # 6.1+ Reported-by: David Sterba <dsterba@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:12:52 +02:00
Naohiro Aota	62be7afcc1	btrfs: zoned: requeue to unused block group list if zone finish failed btrfs_zone_finish() can fail for several reason. If it is -EAGAIN, we need to try it again later. So, put the block group to the retry list properly. Failing to do so will keep the removable block group intact until remount and can causes unnecessary ENOSPC. Fixes: `74e91b12b1` ("btrfs: zoned: zone finish unused block group") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:23 +02:00
Naohiro Aota	3061801420	btrfs: zoned: do not remove unwritten non-data block group There are some reports of "unable to find chunk map for logical 2147483648 length 16384" error message appears in dmesg. This means some IOs are occurring after a block group is removed. When a metadata tree node is cleaned on a zoned setup, we keep that node still dirty and write it out not to create a write hole. However, this can make a block group's used bytes == 0 while there is a dirty region left. Such an unused block group is moved into the unused_bg list and processed for removal. When the removal succeeds, the block group is removed from the transaction->dirty_bgs list, so the unused dirty nodes in the block group are not sent at the transaction commit time. It will be written at some later time e.g, sync or umount, and causes "unable to find chunk map" errors. This can happen relatively easy on SMR whose zone size is 256MB. However, calling do_zone_finish() on such block group returns -EAGAIN and keep that block group intact, which is why the issue is hidden until now. Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:23 +02:00
Filipe Manana	d6be378de0	btrfs: remove btrfs_clear_extent_bits() It's just a simple wrapper around btrfs_clear_extent_bit() that passes a NULL for its last argument (a cached extent state record), plus there is not counter part - we have a btrfs_set_extent_bit() but we do not have a btrfs_set_extent_bits() (plural version). So just remove it and make all callers use btrfs_clear_extent_bit() directly. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	279b4db10e	btrfs: use cached state when falling back from NOCoW write to CoW write We have a cached extent state record from the previous extent locking so we can use when setting the EXTENT_NORESERVE in the range, allowing the operation to be faster if the extent io tree is relatively large. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	bfc9d71aa4	btrfs: set EXTENT_NORESERVE before range unlock in btrfs_truncate_block() Set the EXTENT_NORESERVE bit in the io tree before unlocking the range so that we can use the cached state and speedup the operation, since the unlock operation releases the cached state. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Johannes Thumshirn	5ae011bcbb	btrfs: don't print relocation messages from auto reclaim When BTRFS is doing automatic block-group reclaim, it is spamming the kernel log messages a lot. Add a 'verbose' parameter to btrfs_relocate_chunk() and btrfs_relocate_block_group() to control the verbosity of these log message. This way the old behaviour of printing log messages on a user-space initiated balance operation can be kept while excessive log spamming due to auto reclaim is mitigated. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Johannes Thumshirn	a507904090	btrfs: remove redundant auto reclaim log message Remove the log message before reclaiming a chunk in btrfs_reclaim_bgs_work(). Especially with automatic block-group reclaiming these messages spam the kernel log. Note there is also a tracepoint for the same condition to ease debugging. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	240fafaa44	btrfs: make btrfs_check_nocow_lock() check more than one extent Currently btrfs_check_nocow_lock() stops at the first extent it finds and that extent may be smaller than the target range we want to NOCOW into. But we can have multiple consecutive extents which we can NOCOW into, so by stopping at the first one we find we just make the caller do more work by splitting the write into multiple ones, or in the case of mmap writes with large folios we fail with -ENOSPC in case the folio's range is covered by more than one extent (the fallback to NOCOW for mmap writes in case there's no available data space to reserve/allocate was recently added by the patch "btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents"). Improve on this by checking for multiple consecutive extents. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	68e0fcc361	btrfs: assert we can NOCOW the range in btrfs_truncate_block() We call btrfs_check_nocow_lock() to see if we can NOCOW a block sized range but we don't check later if we can NOCOW the whole range. It's unexpected to be able to NOCOW a range smaller than blocksize, so add an assertion to check the NOCOW range matches the blocksize. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	c6482cff95	btrfs: update function comment for btrfs_check_nocow_lock() The documentation for the @nowait parameter is missing, so add it. The @nowait parameter was added in commit `80f9d24130` ("btrfs: make btrfs_check_nocow_lock nowait compatible"), which forgot to update the function comment. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	601ea9c42a	btrfs: use btrfs_inode local variable at btrfs_page_mkwrite() Most of the time we want to use the btrfs_inode, so change the local inode variable to be a btrfs_inode instead of a VFS inode, reducing verbosity by eliminating a lot of BTRFS_I() calls. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	11ad7983c2	btrfs: use variable for io_tree when clearing range in btrfs_page_mkwrite() We have the inode's io_tree already stored in a local variable, so use it instead of grabbing it again in the call to btrfs_clear_extent_bit(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	6599716de2	btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents If we attempt a mmap write into a NOCOW file or a prealloc extent when there is no more available data space (or unallocated space to allocate a new data block group) and we can do a NOCOW write (there are no reflinks for the target extent or snapshots), we always fail due to -ENOSPC, unlike for the regular buffered write and direct IO paths where we check that we can do a NOCOW write in case we can't reserve data space. Simple reproducer: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi umount $DEV &> /dev/null mkfs.btrfs -f -b $((512 * 1024 * 1024)) $DEV mount $DEV $MNT touch $MNT/foobar # Make it a NOCOW file. chattr +C $MNT/foobar # Add initial data to file. xfs_io -c "pwrite -S 0xab 0 1M" $MNT/foobar # Fill all the remaining data space and unallocated space with data. dd if=/dev/zero of=$MNT/filler bs=4K &> /dev/null # Overwrite the file with a mmap write. Should succeed. xfs_io -c "mmap -w 0 1M" \ -c "mwrite -S 0xcd 0 1M" \ -c "munmap" \ $MNT/foobar # Unmount, mount again and verify the new data was persisted. umount $MNT mount $DEV $MNT od -A d -t x1 $MNT/foobar umount $MNT Running this: $ ./test.sh (...) wrote 1048576/1048576 bytes at offset 0 1 MiB, 256 ops; 0.0008 sec (1.188 GiB/sec and 311435.5231 ops/sec) ./test.sh: line 24: 234865 Bus error xfs_io -c "mmap -w 0 1M" -c "mwrite -S 0xcd 0 1M" -c "munmap" $MNT/foobar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 1048576 Fix this by not failing in case we can't allocate data space and we can NOCOW into the target extent - reserving only metadata space in this case. After this change the test passes: $ ./test.sh (...) wrote 1048576/1048576 bytes at offset 0 1 MiB, 256 ops; 0.0007 sec (1.262 GiB/sec and 330749.3540 ops/sec) 0000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd * 1048576 A test case for fstests will be added soon. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
David Sterba	e8d2e254dc	btrfs: use clear_and_wake_up_bit() where open coded There are two cases open coding the clear and wake up pattern, we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	72b2b199d5	btrfs: accessors: rename variable for folio offset There used to be 'oip' short for offset in page, which got changed during conversion to folios, the name is a bit confusing so rename it. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	ae80748225	btrfs: accessors: factor out split memcpy with two sources The case of a reading the bytes from 2 folios needs two memcpy()s, the compiler does not emit calls but two inline loops. Factoring out the code makes some improvement (stack, code) and in the future will provide an optimized implementation as well. (The analogical version with two destinations is not done as it increases stack usage but can be done if needed.) The address of the second folio is reordered before the first memcpy, which leads to an optimization reusing the vmemmap_base and page_offset_base (implementing folio_address()). Stack usage reduction: btrfs_get_32 -8 (32 -> 24) btrfs_get_64 -8 (32 -> 24) Code size reduction: text data bss dec hex filename 1454279 115665 16088 1586032 183370 pre/btrfs.ko 1454229 115665 16088 1585982 18333e post/btrfs.ko DELTA: -50 As this is the last patch in this series, here's the overall diff starting and including commit "btrfs: accessors: simplify folio bounds checks": Stack: btrfs_set_16 -72 (88 -> 16) btrfs_get_32 -56 (80 -> 24) btrfs_set_8 -72 (88 -> 16) btrfs_set_64 -64 (88 -> 24) btrfs_get_8 -72 (80 -> 8) btrfs_get_16 -64 (80 -> 16) btrfs_set_32 -64 (88 -> 24) btrfs_get_64 -56 (80 -> 24) NEW (48): report_setget_bounds 48 LOST/NEW DELTA: +48 PRE/POST DELTA: -472 Code: text data bss dec hex filename 1456601 115665 16088 1588354 183c82 pre/btrfs.ko 1454229 115665 16088 1585982 18333e post/btrfs.ko DELTA: -2372 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	c8b33a57fb	btrfs: accessors: set target address at initialization The target address for the read/write can be simplified as it's the same expression for the first folio. This improves the generated code as the folio address does not have to be cached on stack. Stack usage reduction: btrfs_set_32 -8 (32 -> 24) btrfs_set_64 -8 (32 -> 24) btrfs_get_16 -8 (24 -> 16) Code size reduction: text data bss dec hex filename 1454459 115665 16088 1586212 183424 pre/btrfs.ko 1454279 115665 16088 1586032 183370 post/btrfs.ko DELTA: -180 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	1ed0f75d57	btrfs: accessors: compile-time fast path for u16 Reading/writing 2 bytes (u16) may need 2 folios to be written to, each time it's just one byte so using memcpy for that is an overkill. Add a branch for the split case so that memcpy is now used for u32 and u64. Another side effect is that the u16 types now don't need additional stack space, everything fits to registers. Stack usage is reduced: btrfs_get_16 -8 (32 -> 24) btrfs_set_16 -16 (32 -> 16) Code size reduction: text data bss dec hex filename 1454691 115665 16088 1586444 18350c pre/btrfs.ko 1454459 115665 16088 1586212 183424 post/btrfs.ko DELTA: -232 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	58383c6866	btrfs: accessors: compile-time fast path for u8 Reading/writing 1 byte (u8) is a special case compared to the others as it's always contained in the folio we find, so the split memcpy will never be needed. Turn it to a compile-time check that the memcpy part can be optimized out. The stack usage is reduced: btrfs_set_8 -16 (32 -> 16) btrfs_get_8 -16 (24 -> 8) Code size reduction: text data bss dec hex filename 1454951 115665 16088 1586704 183610 pre/btrfs.ko 1454691 115665 16088 1586444 18350c post/btrfs.ko DELTA: -260 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00

1 2 3 4 5 ...

1369491 Commits