linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00

Author	SHA1	Message	Date
Qu Wenruo	4e2945f73b	btrfs: handle aligned EOF truncation correctly for subpage cases [BUG] For the following fsx -e 1 run, the btrfs still fails the run on 64K page size with 4K fs block size: READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0x26b3a 0x0000 0x15b4 0x0 operation# (mod 256) for the bad data may be 21 [...] LOG DUMP (28 total operations): 1( 1 mod 256): SKIPPED (no operation) 2( 2 mod 256): SKIPPED (no operation) 3( 3 mod 256): SKIPPED (no operation) 4( 4 mod 256): SKIPPED (no operation) 5( 5 mod 256): WRITE 0x1ea90 thru 0x285e0 (0x9b51 bytes) HOLE 6( 6 mod 256): ZERO 0x1b1a8 thru 0x20bd4 (0x5a2d bytes) 7( 7 mod 256): FALLOC 0x22b1a thru 0x272fa (0x47e0 bytes) INTERIOR 8( 8 mod 256): WRITE 0x741d thru 0x13522 (0xc106 bytes) 9( 9 mod 256): MAPWRITE 0x73ee thru 0xdeeb (0x6afe bytes) 10( 10 mod 256): FALLOC 0xb719 thru 0xb994 (0x27b bytes) INTERIOR 11( 11 mod 256): COPY 0x15ed8 thru 0x18be1 (0x2d0a bytes) to 0x25f6e thru 0x28c77 12( 12 mod 256): ZERO 0x1615e thru 0x1770e (0x15b1 bytes) 13( 13 mod 256): SKIPPED (no operation) 14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff (0x8000 bytes) to 0x1000 thru 0x8fff 15( 15 mod 256): SKIPPED (no operation) 16( 16 mod 256): CLONE 0xa000 thru 0xffff (0x6000 bytes) to 0x36000 thru 0x3bfff 17( 17 mod 256): ZERO 0x14adc thru 0x1b78a (0x6caf bytes) 18( 18 mod 256): TRUNCATE DOWN from 0x3c000 to 0x1e2e3 ****WWWW 19( 19 mod 256): CLONE 0x4000 thru 0x11fff (0xe000 bytes) to 0x16000 thru 0x23fff 20( 20 mod 256): FALLOC 0x311e1 thru 0x3681b (0x563a bytes) PAST_EOF 21( 21 mod 256): FALLOC 0x351c5 thru 0x40000 (0xae3b bytes) EXTENDING 22( 22 mod 256): WRITE 0x920 thru 0x7e51 (0x7532 bytes) 23( 23 mod 256): COPY 0x2b58 thru 0xc508 (0x99b1 bytes) to 0x117b1 thru 0x1b161 24( 24 mod 256): TRUNCATE DOWN from 0x40000 to 0x3c9a5 25( 25 mod 256): SKIPPED (no operation) 26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06 (0x1ae7 bytes) 27( 27 mod 256): SKIPPED (no operation) 28( 28 mod 256): READ 0x26b3a thru 0x36633 (0xfafa bytes) RRRR** [CAUSE] The involved operations are: fallocating to largest ever: 0x40000 21 pollute_eof 0x24000 thru 0x2ffff (0xc000 bytes) 21 falloc from 0x351c5 to 0x40000 (0xae3b bytes) 28 read 0x26b3a thru 0x36633 (0xfafa bytes) At operation #21 a pollute_eof is done, by memory mapped write into range [0x24000, 0x2ffff). At this stage, the inode size is 0x24000, which is block aligned. Then fallocate happens, and since it's expanding the inode, it will call btrfs_truncate_block() to truncate any unaligned range. But since the inode size is already block aligned, btrfs_truncate_block() does nothing and exits. However remember the folio at 0x20000 has some range polluted already, although it will not be written back to disk, it still affects the page cache, resulting the later operation #28 to read out the polluted value. [FIX] Instead of early exit from btrfs_truncate_block() if the range is already block aligned, do extra filio zeroing if the fs block size is smaller than the page size and we're truncating beyond EOF. This is to address exactly the above case where memory mapped write can still leave some garbage beyond EOF. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Qu Wenruo	8e4f21f2b1	btrfs: handle unaligned EOF truncation correctly for subpage cases [BUG] The following fsx sequence will fail on btrfs with 64K page size and 4K fs block size: #fsx -d -e 1 -N 4 $mnt/junk -S 36386 READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0xe9ba 0x0000 0x03ac 0x0 operation# (mod 256) for the bad data may be 3 ... LOG DUMP (4 total operations): 1( 1 mod 256): WRITE 0x6c62 thru 0x1147d (0xa81c bytes) HOLE *WWWW 2( 2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 **WWWW 3( 3 mod 256): ZERO 0x1c7aa thru 0x28fe2 (0xc839 bytes) 4( 4 mod 256): MAPREAD 0xe9ba thru 0x1578e (0x6dd5 bytes) RRRR** [CAUSE] Only 2 operations are really involved in this case: 3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes) 3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes) 4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes) At operation 3, fsx pollutes beyond EOF, that is done by mmap() and write into that mmap() range beyond EOF. Such write will fill the range beyond EOF, but it will never reach disk as ranges beyond EOF will not be marked dirty nor uptodate. Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond our isize (which was 0x5448), we should zero out any range beyond EOF (0x5448). During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the unaligned head block. But that function only really zeroes out the block at [0x5000, 0x5fff], it doesn't bother any range other that that block, since those ranges will not be marked dirty nor written back. So the range [0x6000, 0xffff] is still polluted, and later mapread() will return the poisoned value. [FIX] Enhance btrfs_truncate_block() by: - Pass a @start/@end pair to indicate the full truncation range This is to handle the following truncation case: Page size is 64K, fs block size is 4K, truncate range is [6K, 60K] 0 32K 64K \| \|///////////////////////////////////\| \| 6K 60K The range is not aligned for its head block, so we need to call btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0. But with that information we only know to zero the range [6K, 8K), if we zero out the range [6K, 64K), the last block will also be zeroed, causing data loss. So here we need the full range we're truncating, so that we can avoid over-truncation. - Rename @from to @offset As now the parameter is only utilized to locate a block, it's not really carrying the old @from meaning well. - Remove @front parameter With the full truncate range passed in, we can determine if the @offset is at the head or tail block. - Skip truncation if @offset is not in the head nor tail blocks The call site in hole punch unconditionally call btrfs_truncate_block() without even checking the range is aligned or not. If the @offset is neither in the head nor in tail block, it means we can safely ignore it. - Skip truncate if the range inside the target block is already aligned - Make btrfs_truncate_block() zero all blocks beyond EOF Since we have the original range, we know exactly if we're doing truncation beyond EOF (the @end will be (u64)-1). If we're doing truncation beyond EOF, then enlarge the truncation range to the folio end, to address the possibly polluted ranges. Otherwise still keep the zero range inside the block, as we can have large data folios soon, always truncating every blocks inside the same folio can be costly for large folios. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Boris Burkov	3649833a58	btrfs: fix broken drop_caches on extent buffer folios The (correct) commit `e41c81d0d3` ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()") replaced the page_mapped(page) check with a refcount check. However, this refcount check does not work as expected with drop_caches for btrfs's metadata pages. Btrfs has a per-sb metadata inode with cached pages, and when not in active use by btrfs, they have a refcount of 3. One from the initial call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and one from folio_attach_private(). We would expect such pages to get dropped by drop_caches. However, drop_caches calls into mapping_evict_folio() via mapping_try_invalidate() which gets a reference on the folio with find_lock_entries(). As a result, these pages have a refcount of 4, and fail this check. For what it's worth, such pages do get reclaimed under memory pressure, so I would say that while this behavior is surprising, it is not really dangerously broken. When I asked the mm folks about the expected refcount in this case, I was told that the correct thing to do is to donate the refcount from the original allocation to the page cache after inserting it. Therefore, attempt to fix this by adding a put_folio() to the critical spot in alloc_extent_buffer() where we are sure that we have really allocated and attached new pages. We must also adjust folio_detach_private() to properly handle being the last reference to the folio and not do a use-after-free after folio_detach_private(). extent_buffers allocated by clone_extent_buffer() and alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership from allocation to insertion in the mapping does not apply to them. However, we can still folio_put() them safely once they are finished being allocated and have called folio_attach_private(). Finally, removing the generic put_folio() for the allocation from btrfs_detach_extent_buffer_folios() means we need to be careful to do the appropriate put_folio() in allocation failure paths in alloc_extent_buffer(), clone_extent_buffer() and alloc_dummy_extent_buffer(). Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/ Tested-by: Klara Modin <klarasmodin@gmail.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	1886b77f5b	btrfs: use verbose assert at peek_discard_list() We now have a verbose variant of ASSERT() so that we can print the value of the block group's discard_index. So use it for better problem analysis in case the assertion is triggered. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Qu Wenruo	1b660424a6	btrfs: scrub: aggregate small bitmaps into a larger one Currently we have several small bitmaps inside scrub_stripe: - extent_sector_bitmap - error_bitmap - io_error_bitmap - csum_error_bitmap - meta_error_bitmap - meta_gen_error_bitmap All those bitmaps are at most 16 bits long, but unsigned long is either 32 or 64 (more common) bits. This means we're wasting 1/2 or 3/4 space for each bitmap. And we can have 128 scrub_stripe for each device, such wasted space adds up quickly. Instead of using a single unsigned long for each bitmap, aggregate them into a larger bitmap, just like what we're doing for subpage support. This reduces 24 bytes from each scrub_stripe structure on x86_64 systems. This will need a lot of macros converting direct bitmap/bit operations into our scrub_stripe specific helpers, but all those helpers are very small and can be inlined. So overall the overhead shouldn't be that huge, and we save quite some memory space. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Qu Wenruo	f2c19541e4	btrfs: scrub: fix a wrong error type when metadata bytenr mismatches When the bytenr doesn't match for a metadata tree block, we will report it as an csum error, which is incorrect and should be reported as a metadata error instead. Fixes: `a3ddbaebc7` ("btrfs: scrub: introduce a helper to verify one metadata block") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	5bc3b7e2b5	btrfs: defrag: use list_last_entry() at defrag_collect_targets() Instead of using list_entry() against the list's prev entry, use list_last_entry(), which removes the need to know the last member is accessed through the prev list pointer and the naming makes it easier to reason about what we are doing. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	58fe389a2a	btrfs: simplify csum list release at btrfs_put_ordered_extent() Instead of extracting each element by grabbing the list's first member in a local list_head variable, then extracting the csum with list_entry() and iterating with a while loop checking for list emptyness, use the iteration helper list_for_each_entry_safe(). This also removes the need to delete elements from the list with list_del() since the ordered extent is freed immediately after. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	4cde0724c9	btrfs: simplify extracting delayed node at btrfs_first_prepared_delayed_node() Instead of grabbing the next pointer from the list and then doing a list_entry() call, we can simply use list_first_entry(), removing the need for list_head variable. Also there's no need to check if the list is empty before attempting to extract the first element, we can use list_first_entry_or_null(), removing the need for a special if statement and the 'out' label. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	32bc875cbc	btrfs: simplify extracting delayed node at btrfs_first_delayed_node() Instead of grabbing the next pointer from the list and then doing a list_entry() call, we can simply use list_first_entry(), removing the need for list_head variable. Also there's no need to check if the list is empty before attempting to extract the first element, we can use list_first_entry_or_null(), removing the need for a special if statement and the 'out' label. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	c5d12d5b62	btrfs: raid56: use list_last_entry() at cache_rbio() Instead of using list_entry() against the list's prev entry, use list_last_entry(), which removes the need to know the last member is accessed through the prev list pointer and the naming makes it easier to reason about what we are doing. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Filipe Manana	d26d16a432	btrfs: simplify cow only root list extraction during transaction commit There's no need to keep a local variable to extract the first member of the list and then do a list_entry() call, we can use list_first_entry() instead, removing the need for the temporary variable and extracting the first element in a single step. Also, there's no need to do a list_del_init() followed by list_add_tail(), instead we can use list_move_tail(). We are in transaction commit critical section where we don't need to worry about concurrency and that's why we don't take any locks and can use list_move_tail() (we do assert early at commit_cowonly_roots() that we are in the critical section, that the transaction's state is TRANS_STATE_COMMIT_DOING). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Filipe Manana	a20f732822	btrfs: simplify getting and extracting previous transaction at clean_pinned_extents() Instead of detecting if there is a previous transaction by comparing the current transaction's list prev member to the head of the transaction list (fs_info->trans_list), use the list_is_first() helper which contains that logic and the naming makes sense since a new transaction is always added to the end of the list fs_info->trans_list with list_add_tail(). We are also extracting the previous transaction with list_last_entry() against the transaction, which is correct but confusing because that function is usually meant to be used against a pointer to the start of a list and not a member of a list. It is easier to reason by either calling list_first_entry() against the list fs_info->trans_list, since we can never have more than two transactions in the list, or by calling list_prev_entry() against the transaction. So change that to use the later method. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Filipe Manana	d887f03fdb	btrfs: simplify getting and extracting previous transaction during commit Instead of detecting if there is a previous transaction by comparing the current transaction's list prev member to the head of the transaction list (fs_info->trans_list), use the list_is_first() helper which contains that logic and the naming makes sense since a new transaction is always added to the end of the list fs_info->trans_list with list_add_tail(). And instead of extracting the previous transaction with the more generic list_entry() helper against the current transaction's list prev member, use the more specific list_prev_entry() helper, which makes it clear what we are doing and is shorter. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	8c4cfa99c2	btrfs: move transaction aborts to the error site in add_to_free_space_tree() Transaction aborts should be done next to the place the error happens, which was not done in add_to_free_space_tree(). Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	11f25fad92	btrfs: move transaction aborts to the error site in remove_from_free_space_tree() Transaction aborts should be done next to the place the error happens, which was not done in remove_from_free_space_tree(). Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	addaa67e33	btrfs: move transaction aborts to the error site in convert_free_space_to_extents() Transaction aborts should be done next to the place the error happens, which was not done in convert_free_space_to_extents(). The DEBUG_WARN() is removed because we get the abort message. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	14338d4661	btrfs: move transaction aborts to the error site in convert_free_space_to_bitmaps() Transaction aborts should be done next to the place the error happens, which was not done in convert_free_space_to_bitmaps(). The DEBUG_WARN() is removed because we get the abort message. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Qu Wenruo	ce6920dba8	btrfs: scrub: move error reporting members to stack Currently the following members of scrub_stripe are only utilized for error reporting: - init_error_bitmap - init_nr_io_errors - init_nr_csum_errors - init_nr_meta_errors - init_nr_meta_gen_errors There is no need to put all those members into scrub_stripe, which take 24 bytes for each stripe, and we have 128 stripes for each device. Instead introduce a structure, scrub_error_records, and move all above members into that structure. And allocate such structure from stack inside scrub_stripe_read_repair_worker(). Since that function is called from a workqueue context, we have more than enough stack space for just 24 bytes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Qu Wenruo	ec1f3a207c	btrfs: scrub: update device stats when an error is detected [BUG] Since the migration to the new scrub_stripe interface, scrub no longer updates the device stats when hitting an error, no matter if it's a read or checksum mismatch error. E.g: BTRFS info (device dm-2): scrub: started on devid 1 BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0 Note there is no line showing the device stats error update. [CAUSE] In the migration to the new scrub_stripe interface, we no longer call btrfs_dev_stat_inc_and_print(). [FIX] - Introduce a new bitmap for metadata generation errors * A new bitmap @meta_gen_error_bitmap is introduced to record which blocks have metadata generation mismatch errors. * A new counter for that bitmap @init_nr_meta_gen_errors, is also introduced to store the number of generation mismatch errors that are found during the initial read. This is for the error reporting at scrub_stripe_report_errors(). * New dedicated error message for unrepaired generation mismatches * Update @meta_gen_error_bitmap if a transid mismatch is hit - Add btrfs_dev_stat_inc_and_print() calls to the following call sites * scrub_stripe_report_errors() * scrub_write_endio() This is only for the write errors. This means there is a minor behavior change: - The timing of device stats error message Since we concentrate the error messages at scrub_stripe_report_errors(), the device stats error messages will all show up in one go, after the detailed scrub error messages: BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Fixes: `e02ee89baa` ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Naohiro Aota	45a59513b4	btrfs: add support for reclaiming from sub-space space_info Modify btrfs_async_{data,metadata}_reclaim() to run the reclaim process on the sub-spaces as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Naohiro Aota	635da7ea9a	btrfs: add block reserve for treelog We need to add a dedicated block_rsv for tree-log, because the block_rsv serves for a tree node allocation in btrfs_alloc_tree_block(). Currently, tree-log tree uses fs_info->empty_block_rsv, which is shared across trees and points to the normal metadata space_info. Instead, we add a dedicated block_rsv and that block_rsv can use the dedicated sub-space_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	df07664074	btrfs: use proper data space_info for zoned mode Now that, we have data sub-space for the zoned mode. Tweak some space_info functions to use proper space_info for a file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	cc0517fe77	btrfs: tweak extent/chunk allocation for space_info sub-space Make the extent allocator and the chunk allocator aware of the sub-space. It now uses BTRFS_SUB_GROUP_DATA_RELOC sub-space for data relocation block group, and uses BTRFS_SUB_GROUP_TREELOG for metadata tree-log block group. And, it needs to check the space_info is the right one when a block group candidate is given. Also, new block group should now belong to the specified one. Now that, block_group->space_info is always set before btrfs_add_bg_to_space_info(), we no longer need to "find" the space_info. So, rename the variable name to address that as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	9a3023b828	btrfs: introduce tree-log sub-space_info Introduce the tree-log sub-space_info, which is sub-space of metadata space_info and dedicated for tree-log node allocation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	f92ee31e03	btrfs: introduce btrfs_space_info sub-group Current code assumes we have only one space_info for each block group type (DATA, METADATA, and SYSTEM). We sometime need multiple space infos to manage special block groups. One example is handling the data relocation block group for the zoned mode. That block group is dedicated for writing relocated data and we cannot allocate any regular extent from that block group, which is implemented in the zoned extent allocator. This block group still belongs to the normal data space_info. So, when all the normal data block groups are full and there is some free space in the dedicated block group, the space_info looks to have some free space, while it cannot allocate normal extent anymore. That results in a strange ENOSPC error. We need to have a space_info for the relocation data block group to represent the situation properly. Adds a basic infrastructure for having a "sub-group" of a space_info: creation and removing. A sub-group space_info belongs to one of the primary space_infos and has the same flags as its parent. This commit first introduces the relocation data sub-space_info, and the next commit will introduce tree-log sub-space_info. In the future, it could be useful to implement tiered storage for btrfs e.g. by implementing a sub-group space_info for block groups resides on a fast storage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	4d5a047e07	btrfs: add space_info parameter for block group creation Add struct btrfs_space_info parameter to btrfs_make_block_group(), its related functions and related struct. Passed space_info will have a new block group. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	098a442d5b	btrfs: add space_info argument to btrfs_chunk_alloc() Take a btrfs_space_info argument in btrfs_chunk_alloc(). New block group will belong to that space_info. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	1cfdbe0d53	btrfs: factor out check_removing_space_info() from btrfs_free_block_groups() Factor out check_removing_space_info() from btrfs_free_block_groups(). It sanity checks a to-be-removed space_info. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	4ec06a9ddb	btrfs: factor out do_async_reclaim_{data,metadata}_space() Factor out the main part of btrfs_async_reclaim_data_space() to do_async_reclaim_data_space(), so it can take data space_info parameter it is working on. Do the same for metadata. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	ac5578fef3	btrfs: factor out init_space_info() from create_space_info() Factor out initialization of the space_info struct, which is used in a later patch. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	1c34e71966	btrfs: pass struct btrfs_inode to btrfs_free_reserved_data_space_noquota() As well as the last patch, pass struct btrfs_inode to the function and let it distinguish which data space it is working on in a later patch. There is no functional change with this commit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Naohiro Aota	5d39fda880	btrfs: pass btrfs_space_info to btrfs_reserve_data_bytes() Pass struct btrfs_space_info to btrfs_reserve_data_bytes() to allow reserving the data from multiple data space_info candidates. This is a preparation for the following commits and there is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	66864101d1	btrfs: make extent unpinning more efficient when committing transaction At btrfs_finish_extent_commit() we have this loop that keeps finding an extent range to unpin in the transaction's pinned_extents io tree, caches the extent state and then passes that cached extent state to btrfs_clear_extent_dirty(), which will free that extent state since we clear the only bit it can have set. So on each loop iteration we do a full io tree search and the cached state is used only to avoid having a tree search done by btrfs_clear_extent_dirty(). During the lifetime of a transaction we can pin many thousands of extents, resulting in a large and deep rb tree that backs the io tree. For example, for the following fs_mark run on a 12 cores boxes: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 FILES=100000 THREADS=$(nproc --all) echo "performance" \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount $DEV $MNT OPTS="-S 0 -L 8 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT an histogram for the number of ranges (elements) in the pinned extents io tree of a transaction was the following: Count: 76 Range: 5440.000 - 51088.000; Mean: 27354.368; Median: 28312.000; Stddev: 9800.767 Percentiles: 90th: 40486.000; 95th: 43322.000; 99th: 51088.000 5440.000 - 6805.809: 1 ### 6805.809 - 10652.034: 1 ### 10652.034 - 13326.178: 3 ######## 13326.178 - 16671.590: 8 ###################### 16671.590 - 20856.773: 7 #################### 20856.773 - 26092.528: 13 #################################### 26092.528 - 32642.571: 19 ##################################################### 32642.571 - 40836.818: 17 ############################################### 40836.818 - 51088.000: 7 #################### We can improve on this by grabbing the next state before calling btrfs_clear_extent_dirty(), avoiding a full tree search on the next iteration which always has an O(log n) complexity while grabbing the next element (rb_next() rbtree operation) is in the worst case O(log n) too, but very often much less than that, making it more efficient. Here follow histograms for the execution times, in nanoseconds, of btrfs_finish_extent_commit() before and after applying this patch and all the other patches in the same patchset. Before patchset: Count: 32 Range: 3925691.000 - 269990635.000; Mean: 133814526.906; Median: 122758052.000; Stddev: 65776550.375 Percentiles: 90th: 228672087.000; 95th: 265187000.000; 99th: 269990635.000 3925691.000 - 5993208.660: 1 #### 5993208.660 - 75878537.656: 4 ################## 75878537.656 - 115840974.514: 12 ##################################################### 115840974.514 - 176850157.761: 6 ########################### 176850157.761 - 269990635.000: 9 ######################################## After patchset: Count: 32 Range: 1849393.000 - 231491064.000; Mean: 126978584.625; Median: 123732897.000; Stddev: 58007821.806 Percentiles: 90th: 203055491.000; 95th: 219952699.000; 99th: 231491064.000 1849393.000 - 2997642.092: 1 #### 2997642.092 - 88111637.071: 9 ##################################### 88111637.071 - 142818264.414: 9 ##################################### 142818264.414 - 231491064.000: 13 ##################################################### Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	93ef6c232a	btrfs: remove variable to track trimmed bytes at btrfs_finish_extent_commit() We don't need to keep track of discarded (trimmed) bytes at btrfs_finish_extent_commit() but we are declaring a local variable for that and passing a reference to the btrfs_discard_extent() calls when we are processing delete block groups. So instead pass NULL to btrfs_discard_extent() and remove that variable. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	2598371392	btrfs: don't BUG_ON() when unpinning extents during transaction commit In the final phase of a transaction commit, when unpinning extents at btrfs_finish_extent_commit(), there's no need to BUG_ON() if we fail to unpin an extent range. All that can happen is that we fail to return the extent range to the in-memory free space cache, meaning no future space allocations can reuse that extent range while the fs is mounted. So instead return the error to the caller and make it abort the transaction, so that the error is noticed and prevent misteriously leaking space. We keep track of the first error we get while unpinning an extent range and keep trying to unpin all the following extent ranges, so that we attempt to do all discards. The transaction abort will deal with all resource cleanups. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	b2460c2aee	btrfs: remove unnecessary NULL checks before freeing extent state When manipulating extent bits for an io tree range, we have this pattern: if (prealloc) btrfs_free_extent_state(prealloc); but this is not needed nowadays since btrfs_free_extent_state() ignores a NULL pointer argument, following the common pattern of kernel and btrfs freeing functions, as well as libc and other user space libraries. So remove the NULL checks, reducing source code and object size. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	aa2c80a9ae	btrfs: avoid re-searching tree when setting bits in an extent range When setting bits for an extent range (set_extent_bit()), if the current extent state record starts after the target range, we always do a jump to the 'search_again' label, which will cause us to do a full tree search for the next state if the current state ends before the target range. Unless we need to reschedule, we can just grab the next state and process it, avoiding a full tree search, even if that next state is not contiguous, as we'll allocate and insert a new prealloc state if needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	b61dd9b1cb	btrfs: avoid repeated extent state processing when setting extent bits When setting bits for an extent range, if we find an extent state with its start offset greater than current start offset, we insert a new extent state to cover the gap, with its end offset computed and stored in the @this_end local variable, and after the insertion we update the current start offset to @this_end + 1. However if the insert_state() call resulted in an extent state merge then the end offset of the merged extent may be greater than @this_end and if that's the case, since we jump to the 'search_again' label, we'll do a full tree search that will leave us in the same extent state - this is harmless but wastes time by doing a pointless tree search and extent state processing. So improve on this by updating the current start offset to the end offset of the inserted state plus 1. This also removes the use of the @this_end variable and directly set the value in the prealloc extent state to avoid any confusion and misuse in the future. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	8faab454c5	btrfs: simplify last record detection at set_extent_bit() There's no need to compare the current extent state's end offset to (u64)-1 to check if we have the last possible record and to check as as well if after updating the start offset to the end offset of the current record plus one we are still inside the target range. Instead we can simplify and exit if the current extent state ends at or after the target range and then remove the check for the (u64)-1 as well as the check to see if the updated start offset (to last_end + 1) is still inside the target range. Besides the simplification, this also avoids seaching for the next extent state record (through next_state()) when the current extent state record ends at the same offset as our target range, which is pointless and only wastes times iterating through the rb tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	41d69d4d78	btrfs: exit after state split error at set_extent_bit() If split_state() returned an error we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and hit a use after free when calling set_state_bits() since the extent state record which the local variable 'prealloc' points to was freed by split_state(). So jump to the label 'out' after calling extent_io_tree_panic() and set the 'prealloc' pointer to NULL since split_state() has already freed it when it hit an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	67f10a1018	btrfs: exit after state insertion failure at set_extent_bit() If insert_state() state failed it returns an error pointer and we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and call cache_state() which will dereference the error pointer, resulting in an invalid memory access. So jump to the 'out' label after calling extent_io_tree_panic(), it also makes the code more clear besides dealing with the exotic scenario where CONFIG_BUG is disabled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	0edc1a5c54	btrfs: simplify last record detection at btrfs_convert_extent_bit() There's no need to compare the current extent state's end offset to (u64)-1 to check if we have the last possible record and to check as as well if after updating the start offset to the end offset of the current record plus one we are still inside the target range. Instead we can simplify and exit if the current extent state ends at or after the target range and then remove the check for the (u64)-1 as well as the check to see if the updated start offset (to last_end + 1) is still inside the target range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	be2270262f	btrfs: avoid re-searching tree when converting bits in an extent range When converting bits for an extent range (btrfs_convert_extent_bit()), if the current extent state record starts after the target range, we always do a jump to the 'search_again' label, which will cause us to do a full tree search for the next state if the current state ends before the target range. Unless we need to reschedule, we can just grab the next state and process it, avoiding a full tree search, even if that next state is not contiguous, as we'll allocate and insert a new prealloc state if needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	eeb808422f	btrfs: avoid repeated extent state processing when converting extent bits When converting bits for an extent range, if we find an extent state with its start offset greater than current start offset, we insert a new extent state to cover the gap, with its end offset computed and stored in the @this_end local variable, and after the insertion we update the current start offset to @this_end + 1. However if the insert_state() call resulted in an extent state merge then the end offset of the merged extent may be greater than @this_end and if that's the case, since we jump to the 'search_again' label, we'll do a full tree search that will leave us in the same extent state - this is harmless but wastes time by doing a pointless tree search and extent state processing. So improve on this by updating the current start offset to the end offset of the inserted state plus 1. This also removes the use of the @this_end variable and directly set the value in the prealloc extent state to avoid any confusion and misuse in the future. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	240dd0e1bb	btrfs: avoid unnecessary next node searches when clearing bits from extent range When clearing bits for a range in an io tree, at clear_state_bit(), we always go search for the next node (through next_state() -> rb_next()) and return it. However if the current extent state record ends at or after the target range passed to btrfs_clear_extent_bit_changeset() or btrfs_convert_extent_bit(), we are just wasting time finding that next node since we won't use it in those functions. Improve on this by skipping the next node search if the current node ends at or after the target range. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	3bf179e36d	btrfs: exit after state insertion failure at btrfs_convert_extent_bit() If insert_state() state failed it returns an error pointer and we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and call cache_state() which will dereference the error pointer, resulting in an invalid memory access. So jump to the 'out' label after calling extent_io_tree_panic(), it also makes the code more clear besides dealing with the exotic scenario where CONFIG_BUG is disabled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	2a72dd9996	btrfs: exit after state split error at btrfs_convert_extent_bit() If split_state() returned an error we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and hit a use after free when calling set_state_bits() since the extent state record which the local variable 'prealloc' points to was freed by split_state(). So jump to the label 'out' after calling extent_io_tree_panic() and set the 'prealloc' pointer to NULL since split_state() has already freed it when it hit an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	5f9c554a6c	btrfs: remove duplicate error check at btrfs_convert_extent_bit() There's no need to check if split_state() returned an error twice, instead unify into a single if statement after setting 'prealloc' to NULL, because on error split_state() frees the 'prealloc' extent state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	f2a24bef55	btrfs: simplify last record detection at btrfs_clear_extent_bit_changeset() Instead of checking for an end offset of (u64)-1 (U64_MAX) for the current extent state's end, and then checking after updating the current start offset if it's now beyond the range's end offset, we can simply stop if the current extent state's end is greater than or equals to our range's end offset. This helps remove one comparison under the 'next' label and allows to remove the if statement that checks if the start offset is greater than the end offset under the 'search_again' label. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	6c28102f9a	btrfs: avoid extra tree search at btrfs_clear_extent_bit_changeset() When we find an extent state that starts before our range's start we split it and jump into the 'search_again' label with our start offset remaining the same, making us then go to the 'again' label and search again for an extent state that contains the 'start' offset, and this time it finds the same extent state but with its start offset set to our range's start offset (due to the split). This is because we have consumed the preallocated extent state record and we may need to split again, and by jumping to 'again' we release the spinlock, allocate a new prealloc state and restart the search. However we may not need to restart and allocate a new extent state in case we don't find extent states that cross our end offset, therefore no need for further extent state splits, or we may be able to do an atomic allocation (which is quick even if it fails). In these cases it's a waste to restart the search. So change the behaviour to do the restart only if we need to reschedule, otherwise fall through - if we need to allocate an extent state for split operations, we will try an atomic allocation and if that fails we will do the restart as before. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	c832378622	btrfs: use bools for local variables at btrfs_clear_extent_bit_changeset() Several variables are defined as integers but used as booleans, and the 'delete' variable can be made const since it's not changed after being declared. So change them to proper booleans and simplify setting their value. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Filipe Manana	5af1eae78d	btrfs: add missing error return to btrfs_clear_extent_bit_changeset() We have a couple error branches where we have an error stored in the 'err' variable and then jump to the 'out' label, however we don't return that error, we just return 0. Normally this is not a problem since those error branches call extent_io_tree_panic() which triggers a BUG() call, however it's possible to have rather exotic kernel config with CONFIG_BUG disabled in which case the BUG() call does nothing and we fallthrough. So make sure to return the error, not just to fix that exotic case but also to make the code less confusing. While at it also rename the 'err' variable to 'ret' since this is the style we prefer and use more widely. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Filipe Manana	2187540b6f	btrfs: exit after state split error at btrfs_clear_extent_bit_changeset() If split_state() returned an error we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and hit a use after free when calling clear_state_bit() since the extent state record which the local variable 'prealloc' points to was freed by split_state(). So jump to the label 'out' after calling extent_io_tree_panic() and set the 'prealloc' pointer to NULL since split_state() has already freed it when it hit an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Filipe Manana	f389e7b982	btrfs: remove duplicate error check at btrfs_clear_extent_bit_changeset() There's no need to check if split_state() returned an error twice, instead unify into a single if statement after setting 'prealloc' to NULL, because on error split_state() frees the 'prealloc' extent state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Qu Wenruo	007fa63225	btrfs: get rid of btrfs_read_dev_super() The function is introduced by commit `a512bbf855` ("Btrfs: superblock duplication") at the beginning of btrfs. It leaved a comment saying we'd need a special mount option to read all super blocks, but it's never been implemented and there was not need/request for it. The check/rescue tools are able to start from a specific copy and use it as primary eventually. This means btrfs_read_dev_super() is always reading the first super block, making all the code finding the latest super block unnecessary. Just remove that function and replace all call sites with btrfs_read_disk_super(bdev, 0, false). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Qu Wenruo	63f32b7b5d	btrfs: merge btrfs_read_dev_one_super() into btrfs_read_disk_super() We have two functions to read a super block from a block device: - btrfs_read_dev_one_super() Exported from disk-io.c - btrfs_read_disk_super() Local to volumes.c And they have some minor differences: - btrfs_read_dev_one_super() uses @copy_num Meanwhile btrfs_read_disk_super() relies on the physical and expected bytenr passed from the caller. The parameter list of btrfs_read_dev_one_super() is more user friendly. - btrfs_read_disk_super() makes sure the label is NUL terminated We do not need two different functions doing the same job, so merge the behavior into btrfs_read_disk_super() by: - Remove btrfs_read_dev_one_super() - Export btrfs_read_disk_super() The name pairs with btrfs_release_disk_super() perfectly. - Change the parameter list of btrfs_read_disk_super() to mimic btrfs_read_dev_one_super() All existing callers are calculating the physical address and expect bytenr before calling btrfs_read_disk_super() already. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Daniel Vacek	13ae88706a	btrfs: get rid of goto in alloc_test_extent_buffer() The `free_eb` label is used only once. Simplify by moving the code inplace. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Josef Bacik	5e121ae687	btrfs: use buffer xarray for extent buffer writeback operations Currently we have this ugly back and forth with the btree writeback where we find the folio, find the eb associated with that folio, and then attempt to writeback. This results in two different paths for subpage ebs and >= page size ebs. Clean this up by adding our own infrastructure around looking up tagged ebs and writing the ebs out directly. This allows us to unify the subpage and >= pagesize IO paths, resulting in a much cleaner writeback path for extent buffers. I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM. I used smallfiles100k, but reduced the files to 1k to make it run faster, the results are as follows, with the statistically significant improvements marked with , there were no regressions. fsperf was run with -n 10 for both runs, so the baseline is the average 10 runs and the test is the average of 10 runs. smallfiles100k results metric baseline current stdev diff ================================================================================ avg_commit_ms 68.58 58.44 3.35 -14.79% commits 270.60 254.70 16.24 -5.88% dev_read_iops 48 48 0 0.00% dev_read_kbytes 1044 1044 0 0.00% dev_write_iops 866117.90 850028.10 14292.20 -1.86% dev_write_kbytes 10939976.40 10605701.20 351330.32 -3.06% elapsed 49.30 33 1.64 -33.06% * end_state_mount_ns 41251498.80 35773220.70 2531205.32 -13.28% * end_state_umount_ns 1.90e+09 1.50e+09 14186226.85 -21.38% * max_commit_ms 139 111.60 9.72 -19.71% * sys_cpu 4.90 3.86 0.88 -21.29% write_bw_bytes 42935768.20 64318451.10 1609415.05 49.80% * write_clat_ns_mean 366431.69 243202.60 14161.98 -33.63% * write_clat_ns_p50 49203.20 20992 264.40 -57.34% * write_clat_ns_p99 827392 653721.60 65904.74 -20.99% * write_io_kbytes 2035940 2035940 0 0.00% write_iops 10482.37 15702.75 392.92 49.80% * write_lat_ns_max 1.01e+08 90516129 3910102.06 -10.29% * write_lat_ns_mean 366556.19 243308.48 14154.51 -33.62% * As you can see we get about a 33% decrease runtime, with a 50% throughput increase, which is pretty significant. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Josef Bacik	4bc0a3cb75	btrfs: set DIRTY and WRITEBACK tags on the buffer_tree In preparation for changing how we do writeout of extent buffers, start tagging the extent buffer xarray with DIRTY and WRITEBACK to make it easier to find extent buffers that are in either state. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Josef Bacik	19d7f65f03	btrfs: convert the buffer_radix to an xarray In order to fully utilize xarray tagging to improve writeback we need to convert the buffer_radix to a proper xarray. This conversion is relatively straightforward as the radix code uses the xarray underneath. Using xarray directly allows for quite a lot less code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
David Sterba	656e9f51de	btrfs: rename btrfs_discard workqueue to btrfs-discard We use the "btrfs-" prefix for our workqueues, the discard has underscore instead of dash, so unify it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	13d6d866e8	btrfs: on unknown chunk allocation policy fallback to regular We have only two chunk allocation policies right now and the switch/cases don't handle an unknown one properly. The error is in the impossible category (the policy is stored only in memory), we don't have to BUG(), falling back to regular policy should be safe. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	3329d3d833	btrfs: reformat comments in acls_after_inode_item() Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	f24d25544f	btrfs: switch int dev_replace_is_ongoing variables/parameters to bool Both the variable and the parameter are used as logical indicators so convert them to bool. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	f963e0128b	btrfs: trivial conversion to return bool instead of int Old code has a lot of int for bool return values, bool is recommended and done in new code. Convert the trivial cases that do simple 0/false and 1/true. Functions comment are updated if needed. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
Qu Wenruo	73d6bcf41b	btrfs: subpage: reject tree blocks which are not nodesize aligned When btrfs subpage support (fs block < page size) was introduced, a subpage filesystem will only reject tree blocks which cross page boundaries. This used to be a compromise to simplify the tree block handling and still allowing subpage cases to read some old converted filesystems which did not have proper chunk alignment. But in practice, suppose we have the following unaligned tree block on a 64K page sized system: 0 32K 44K 60K 64K \| \|///////////////\| \| Although btrfs has no problem reading the tree block at [44K, 60K), if extent allocator is allocating another tree block, it may choose the range [60K, 74K), as extent allocator has no awareness if it's a subpage metadata request or not. Then we'd get -EINVAL from the following sequence: btrfs_alloc_tree_block() \|- btrfs_reserve_extent() \| Which returned range [60K, 74K) \|- btrfs_init_new_buffer() \|- btrfs_find_create_tree_block() \|- alloc_extent_buffer() \|- check_eb_alignment() Which returned -EINVAL, because the range crosses page boundary. This situation will not fix itself and should mostly mark the fs read-only. Thankfully we didn't really get such reports in the real world because: - The original unaligned tree block is only caused by older btrfs-convert It's before the btrfs-convert rework was done in v4.6, where converted btrfs filesystem can have metadata block groups which are not aligned to nodesize nor stripe size (64K). But after btrfs-progs v4.6, all chunks allocated will be stripe (64K) aligned, thus no more such problem. Considering how old the fix is (v4.6 was released almost 10 years ago), subpage support for btrfs was introduced in v5.15, it should be safe to reject those unaligned tree blocks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
Daniel Vacek	406698623a	btrfs: move folio initialization to one place in attach_eb_folio_to_filemap() This is just a trivial change. The code looks a bit more readable this way, IMO. Move initialization of existing_folio to the beginning of the retry loop so it's set to NULL at one place. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	c779b7980c	btrfs: raid56: rename parameter err to status in endio helpers Trivial renames to unify the naming of blk_status_t variables/parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	853b5727c9	btrfs: change return type of btrfs_alloc_dummy_sum() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed in btrfs_submit_chunk(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	d2080c7a00	btrfs: rename ret2 to ret in btrfs_submit_compressed_read() We can now rename 'ret2' to 'ret' and use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	a83134b48a	btrfs: rename ret to status in btrfs_submit_compressed_read() We're using 'status' for the blk_status_t variables, rename 'ret' so we can use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	79cbc151f9	btrfs: simplify reading bio status in end_compressed_writeback() We don't need to have a separate variable to read the bio status, 'ret' works for that just fine so remove 'error'. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	9c0b0807ec	btrfs: rename error to ret in btrfs_submit_chunk() We can now rename 'error' to 'ret' and use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	beaa7cdb6a	btrfs: rename ret to status in btrfs_submit_chunk() We're using 'status' for the blk_status_t variables, rename 'ret' so we can use it for proper return type. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	64c13195dd	btrfs: change return type of btrfs_bio_csum() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	a24d185c36	btrfs: change return type of btree_csum_one_bio() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed in btrfs_bio_csum(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	9b20d242af	btrfs: change return type of btrfs_csum_one_bio() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed in btrfs_bio_csum(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	6f6e7e98b0	btrfs: change return type of btrfs_lookup_bio_sums() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed in btrfs_submit_chunk(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	ae8ce87165	btrfs: drop redundant local variable in raid_wait_write_end_io() The bio status is read only once, no variable needed for that. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	c0ee55f796	btrfs: merge __setup_root() to btrfs_alloc_root() There's only one caller of __setup_root() so merge it there. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	05a6ec865d	btrfs: use unsigned types for constants defined as bit shifts The unsigned type is a recommended practice (CWE-190, CWE-194) for bit shifts to avoid problems with potential unwanted sign extensions. Although there are no such cases in btrfs codebase, follow the recommendation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	d4d788a776	btrfs: remove unused btrfs_io_stripe::length First added (but not effectively used) in `02c372e1f0` ("btrfs: add support for inserting raid stripe extents"). The structure is initialized to zeros so the only use in btrfs_insert_one_raid_extent() u64 length = bioc->stripes[i].length; struct btrfs_raid_stride *raid_stride = &stripe_extent->strides[i]; if (length == 0) length = bioc->size; the 'if' always happens. Last use in `4016358e85` ("btrfs: remove unused variable length in btrfs_insert_one_raid_extent()") was an obvious cleanup. It seems to be safe to remove, raid-stripe-tree works without using it since 6.6. This was found by tool https://github.com/jirislaby/clang-struct . Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	2d44a15afd	btrfs: use list_first_entry() everywhere Using the helper makes it a bit more clear that we're accessing the first list entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	9e0a739a9e	btrfs: convert ASSERT(0) with handled errors to DEBUG_WARN() The use of ASSERT(0) is maybe useful for some cases but more like a notice for developers. Assertions can be compiled in independently so convert it to a debugging helper. The difference is that it's just a warning and will not end up in BUG(). The converted cases are in connection with proper error handling. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	ed50ab0fec	btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARN Use the conditional warning instead of typing the whole condition. Optional message is printed where it seems clear what could be the problem. Conversion is left out in btree_csum_one_bio() because of the additional condition. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	3db15c6ca6	btrfs: add debug build only WARN Add conditional WARN() wrapper that's enabled only in debug build. It should be used for unexpected conditions that should be noisy. Use it instead of ASSERT(0). As it will not lead to BUG() make sure that continuing is still possible, e.g. the error is handled anyway. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	94cb8d7144	btrfs: use verbose ASSERT() in volumes.c The file volumes.c has about 40 assertions and half of them are suitable for ASSERT() with additional data. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	19468a623a	btrfs: enhance ASSERT() to take optional format string Currently ASSERT() prints the stringified condition and without macro expansions so simple constants like BTRFS_MAX_METADATA_BLOCKSIZE remain readable in the output. There are expressions where we'd like to see the exact values but all we get is something like: assertion failed: em->start <= start && start < extent_map_end(em), in fs/btrfs/extent_map.c:613 It would be nice to be able to print any additional information to help understand the problem. With some preprocessor magic and compile-time optimizations we can enhance ASSERT to work like that as well: ASSERT(value > limit, "value=%llu limit=%llu", value, limit); with free-form printk arguments that will be part of the assertion message. Pros: - helps debugging and understanding reported problems - the optional format is verified at compile-time Cons: - increases the .ko size - writing the assertion code is repetitive (condition, format, values) - format and variable type must match (extra lookup) - needs gcc 8.x and newer, otherwise it's the short format Recommended use is for non-trivial expressions, so basic ASSERT(value) can be used for pointers or sometimes integers. The format has been slightly updated to also print the result of the evaluation of the condition, appended to the stringified condition as "condition :: <value>". Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Yangtao Li	023beaeca6	btrfs: remove BTRFS_REF_LAST from enum btrfs_ref_type Commit `b28b1f0ce4` ("btrfs: delayed-ref: Introduce better documented delayed ref structures") introduced BTRFS_REF_LAST, which can be used for sanity checking, e.g. in switch/case or for loops. In btrfs_ref_type() there is an assertion ASSERT(ref->type == BTRFS_REF_DATA \|\| ref->type == BTRFS_REF_METADATA); to validate the values so we don't need the ending enum. Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Christoph Hellwig	8d243aa9a8	btrfs: use bvec_kmap_local() in btrfs_decompress_buf2page() This removes the last direct poke into bvec internals in btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Christoph Hellwig	adbfd189c4	btrfs: scrub: use virtual addresses directly Instead of the old @page and @page_offset pair inside scrub, here we can directly use the virtual address for a sector. This has the following benefit: - Simplified parameters A single @kaddr will repair @page and @page_offset. - No more unnecessary kmap/kunmap calls Since all pages utilized by scrub is allocated by scrub, and no highmem is allowed, we do not need to do any kmap/kunmap. And add an ASSERT() inside the new scrub_stripe_get_kaddr() to catch any unexpected highmem page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Qu Wenruo	cd678925e9	btrfs: raid56: store a physical address in structure sector_ptr Instead of using a @page + @pg_offset pair inside sector_ptr structure, use a single physical address instead. This allows us to grab both the page and offset from a single u64 value. Although we still need an extra bool value, @has_paddr, to distinguish if the sector is properly mapped (as the 0 physical address is totally valid). This change doesn't change the size of structure sector_ptr, but reduces the parameters of several functions. Note: the original idea and patch is from Christoph Hellwig (https://lore.kernel.org/linux-btrfs/20250409111055.3640328-7-hch@lst.de/) but the final implementation is different. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [ Use physical addresses instead to handle highmem. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	6f3f722df7	btrfs: simplify bvec iteration in index_one_bio() Flatten the two loops by open coding bio_for_each_segment() and advancing the iterator one sector at a time. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Fix a bug that @offset is not increased. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	959ddf2839	btrfs: move kmapping out of btrfs_check_sector_csum() Move kmapping the page out of btrfs_check_sector_csum(). This allows using bvec_kmap_local() where suitable and reduces the number of kmap*() calls in the raid56 code. This also means btrfs_check_sector_csum() will only accept a properly kmapped address. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	3240b2c97b	btrfs: pass a physical address to btrfs_repair_io_failure() Using physical address has the following advantages: - All involved callers only need a single pointer Instead of the old @folio + @offset pair. - No complex poking into the bio_vec structure As a bio_vec can be single or multiple paged, grabbing the real page can be quite complex if the bio_vec is a multi-page one. Instead bvec_phys() will always give a single physical address, and it cab be easily converted to a page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	f6b2d8b134	btrfs: track the next file offset in struct btrfs_bio_ctrl The bio implementation is not something we should really mess around, and we shouldn't recalculate the pos from the folio over and over. Instead just track then end of the current bio in logical file offsets in the btrfs_bio_ctrl, which is much simpler and easier to read. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	8cad6fed82	btrfs: remove the alignment checks in end_bbio_data_read() end_bbio_data_read() checks that each iterated folio fragment is aligned and justifies that with block drivers advancing the bio. But block driver only advance bi_iter, while end_bbio_data_read() uses bio_for_each_folio_all() to iterate the immutable bi_io_vec array that can't be changed by drivers at all. Furthermore btrfs has already did the alignment check of the file offset inside submit_one_sector(), and the size is fixed to fs block size, there is no need to re-do the alignment check again inside the endio function. So just remove the unnecessary alignment check along with the incorrect comment. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Charles Han	ecf5b757c7	btrfs: update and correct description of btrfs_get_or_create_delayed_node() The comment mistakenly says the function is returning PTR_ERR instead of ERR_PTR. Fix it and update it so it's more descriptive. Signed-off-by: Charles Han <hanchunchao@inspur.com> Reviewed-by: David Sterba <dsterba@suse.com> [ Enhance the function comment. ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Yangtao Li	ea2a8bacb1	btrfs: simplify return logic from btrfs_delayed_ref_init() Make this simpler by returning directly when there's no other cleanup needed. Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00

1 2 3 4 5 ...

14211 Commits