2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

1853 Commits

Author SHA1 Message Date
Linus Torvalds
a578dd095d CRC updates for 6.17
Updates for the kernel's CRC (cyclic redundancy check) code:
 
  - Reorganize the architecture-optimized CRC code. It now lives in
    lib/crc/$(SRCARCH)/ rather than arch/$(SRCARCH)/lib/, and it is no
    longer artificially split into separate generic and arch modules.
    This allows better inlining and dead code elimination. The generic
    CRC code is also no longer exported, simplifying the API. (This
    mirrors the similar changes to SHA-1 and SHA-2 in lib/crypto/,
    which can be found in the "Crypto library updates" pull request.)
 
  - Improve crc32c() performance on newer x86_64 CPUs on long messages
    by enabling the VPCLMULQDQ optimized code.
 
  - Simplify the crypto_shash wrappers for crc32_le() and crc32c().
    Register just one shash algorithm for each that uses the (fully
    optimized) library functions, instead of unnecessarily providing
    direct access to the generic CRC code.
 
  - Remove unused and obsolete drivers for hardware CRC engines.
 
  - Remove CRC-32 combination functions that are no longer used.
 
  - Add kerneldoc for crc32_le(), crc32_be(), and crc32c().
 
  - Convert the crc32() macro to an inline function.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaIZ8rRQcZWJpZ2dlcnNA
 a2VybmVsLm9yZwAKCRDzXCl4vpKOK3yOAP9OuoCirD42ZHNSgQeGTzhhZ2jCHiPN
 BPvHChwtE2MSRwEA0ddNX36aOiEKmpjog3TMllOIBz7wBrwZV7KgoX75+AU=
 =uAY8
 -----END PGP SIGNATURE-----

Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull CRC updates from Eric Biggers:

 - Reorganize the architecture-optimized CRC code

   It now lives in lib/crc/$(SRCARCH)/ rather than arch/$(SRCARCH)/lib/,
   and it is no longer artificially split into separate generic and arch
   modules. This allows better inlining and dead code elimination

   The generic CRC code is also no longer exported, simplifying the API.
   (This mirrors the similar changes to SHA-1 and SHA-2 in lib/crypto/,
   which can be found in the "Crypto library updates" pull request)

 - Improve crc32c() performance on newer x86_64 CPUs on long messages by
   enabling the VPCLMULQDQ optimized code

 - Simplify the crypto_shash wrappers for crc32_le() and crc32c()

   Register just one shash algorithm for each that uses the (fully
   optimized) library functions, instead of unnecessarily providing
   direct access to the generic CRC code

 - Remove unused and obsolete drivers for hardware CRC engines

 - Remove CRC-32 combination functions that are no longer used

 - Add kerneldoc for crc32_le(), crc32_be(), and crc32c()

 - Convert the crc32() macro to an inline function

* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (26 commits)
  lib/crc: x86/crc32c: Enable VPCLMULQDQ optimization where beneficial
  lib/crc: x86: Reorganize crc-pclmul static_call initialization
  lib/crc: crc64: Add include/linux/crc64.h to kernel-api.rst
  lib/crc: crc32: Change crc32() from macro to inline function and remove cast
  nvmem: layouts: Switch from crc32() to crc32_le()
  lib/crc: crc32: Document crc32_le(), crc32_be(), and crc32c()
  lib/crc: Explicitly include <linux/export.h>
  lib/crc: Remove ARCH_HAS_* kconfig symbols
  lib/crc: x86: Migrate optimized CRC code into lib/crc/
  lib/crc: sparc: Migrate optimized CRC code into lib/crc/
  lib/crc: s390: Migrate optimized CRC code into lib/crc/
  lib/crc: riscv: Migrate optimized CRC code into lib/crc/
  lib/crc: powerpc: Migrate optimized CRC code into lib/crc/
  lib/crc: mips: Migrate optimized CRC code into lib/crc/
  lib/crc: loongarch: Migrate optimized CRC code into lib/crc/
  lib/crc: arm64: Migrate optimized CRC code into lib/crc/
  lib/crc: arm: Migrate optimized CRC code into lib/crc/
  lib/crc: Prepare for arch-optimized code in subdirs of lib/crc/
  lib/crc: Move files into lib/crc/
  lib/crc32: Remove unused combination support
  ...
2025-07-28 17:43:29 -07:00
Filipe Manana
d6be378de0 btrfs: remove btrfs_clear_extent_bits()
It's just a simple wrapper around btrfs_clear_extent_bit() that passes a
NULL for its last argument (a cached extent state record), plus there is
not counter part - we have a btrfs_set_extent_bit() but we do not have a
btrfs_set_extent_bits() (plural version). So just remove it and make all
callers use btrfs_clear_extent_bit() directly.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Daniel Vacek
f2cb97ee96 btrfs: index buffer_tree using node size
So far we've been deriving the buffer tree index using the sector size.
But each extent buffer covers multiple sectors. This makes the buffer
tree rather sparse.

For example the typical and quite common configuration uses sector size
of 4KiB and node size of 16KiB. In this case it means the buffer tree is
using up to the maximum of 25% of it's slots. Or in other words at least
75% of the tree slots are wasted as never used.

We can score significant memory savings on the required tree nodes by
indexing the tree using the node size instead. As a result far less
slots are wasted and the tree can now use up to all 100% of it's slots
this way.

Note: This works even with unaligned tree blocks as we can still get
      unique index by doing eb->start >> nodesize_shift.

Getting some stats from running fio write test, there is a bit of
variance.  The values presented in the table below are medians from 5
test runs.  The numbers are:

  - # of allocated ebs in the tree
  - # of leaf tree nodes
  - highest index in the tree (radix tree width)):

ebs / leaves / Index |   bare for-next    |      with fix
---------------------+--------------------+-------------------
	post mount   |   16 /  11 / 10e5c |   16 /  10 / 4240
	post test    | 5810 / 891 / 11cfc | 4420 / 252 / 473a
	post rm	     |  574 / 300 / 10ef0 |  540 / 163 / 46e9

In this case (10GiB filesystem) the height of the tree is still 3 levels
but the 4x width reduction is clearly visible as expected. But since the
tree is more dense we can see the 54-72% reduction of leaf nodes. That's
very close to ideal with this test. It means the tree is getting really
dense with this kind of workload.

Also, the fio results show no performance change.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:20 +02:00
Christoph Hellwig
9f43d0ff55 btrfs: call btrfs_close_devices() from ->kill_sb
Although btrfs is not yet implementing blk_holder_ops, there is a
requirement for proper blk_holder_ops:

- blkdev_put() must not be called under sb->s_umount
  The blkdev_put()/bdev_fput() must not be called under sb->s_umount to
  avoid lock order reversal with disk->open_mutex.
  This is for the proper blk_holder_ops callbacks.

  Currently we're fine because we call regular fput() which defers the
  blk holder reclaiming.

To prepare for the future of blk_holder_ops, move the
btrfs_close_devices() calls into btrfs_free_fs_info().

That will be called from kill_sb() callbacks, which is also called for
error handing during mount failures, or there is already an existing
super block.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:19 +02:00
Filipe Manana
60127c29f1 btrfs: qgroup: remove no longer used fs_info->qgroup_ulist
It's not used anymore after commit 0913445082 ("btrfs: qgroup: use
qgroup_iterator in qgroup_convert_meta()"), so remove it.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:04:59 +02:00
Filipe Manana
fd00922abc btrfs: add btrfs prefix to is_fstree() and make it return bool
This is an exported function and therefore it should have a 'btrfs_'
prefix, to make it clear it's btrfs specific, avoid future name collisions
with code outside btrfs, and make its naming consistent with most other
btrfs exported functions.

So add a 'btrfs_' prefix to it and make it return bool instead of int,
since all we need is to return true or false.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:04 +02:00
David Sterba
44cac52341 btrfs: use our message helpers instead of pr_err/pr_warn/pr_info
Our message helpers accept NULL for the fs_info in the context that does
not provide and print the common header of the message. The use of pr_*
helpers is only for special reasons, like module loading, device
scanning or multi-line output (print-tree).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:04 +02:00
David Sterba
0fe04bf132 btrfs: switch RCU helper versions to btrfs_warn()
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
Johannes Thumshirn
694ce5e143 btrfs: zoned: reserve data_reloc block group on mount
Create a block group dedicated for data relocation on mount of a zoned
filesystem.

If there is already more than one empty DATA block group on mount, this
one is picked for the data relocation block group, instead of a newly
created one.

This is done to ensure, there is always space for performing garbage
collection and the filesystem is not hitting ENOSPC under heavy overwrite
workloads.

CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:31 +02:00
Eric Biggers
2c7528d36e btrfs: stop parsing crc32c driver name
To determine whether the crc32c implementation is "fast", use
crc32_optimizations() instead of parsing the crypto_shash driver name.
This keeps the code working as intended after the driver name is changed
by the next commit.

Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20250613183753.31864-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30 09:31:56 -07:00
Qu Wenruo
547e836661 btrfs: handle csum tree error with rescue=ibadroots correctly
[BUG]
There is syzbot based reproducer that can crash the kernel, with the
following call trace: (With some debug output added)

 DEBUG: rescue=ibadroots parsed
 BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010)
 BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8
 BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm
 BTRFS info (device loop0): using free-space-tree
 BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0
 DEBUG: read tree root path failed for tree csum, ret=-5
 BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0
 BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0
 process 'repro' launched './file2' with NULL argv: empty string added
 DEBUG: no csum root, idatacsums=0 ibadroots=134217728
 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI
 KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f]
 CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G           OE       6.15.0-custom+ #249 PREEMPT(full)
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
 RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs]
 Call Trace:
  <TASK>
  btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs]
  btrfs_submit_bbio+0x43e/0x1a80 [btrfs]
  submit_one_bio+0xde/0x160 [btrfs]
  btrfs_readahead+0x498/0x6a0 [btrfs]
  read_pages+0x1c3/0xb20
  page_cache_ra_order+0x4b5/0xc20
  filemap_get_pages+0x2d3/0x19e0
  filemap_read+0x314/0xde0
  __kernel_read+0x35b/0x900
  bprm_execve+0x62e/0x1140
  do_execveat_common.isra.0+0x3fc/0x520
  __x64_sys_execveat+0xdc/0x130
  do_syscall_64+0x54/0x1d0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 ---[ end trace 0000000000000000 ]---

[CAUSE]
Firstly the fs has a corrupted csum tree root, thus to mount the fs we
have to go "ro,rescue=ibadroots" mount option.

Normally with that mount option, a bad csum tree root should set
BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will
ignore csum search.

But in this particular case, we have the following call trace that
caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS:

load_global_roots_objectid():

		ret = btrfs_search_slot();
		/* Succeeded */
		btrfs_item_key_to_cpu()
		found = true;
		/* We found the root item for csum tree. */
		root = read_tree_root_path();
		if (IS_ERR(root)) {
			if (!btrfs_test_opt(fs_info, IGNOREBADROOTS))
			/*
			 * Since we have rescue=ibadroots mount option,
			 * @ret is still 0.
			 */
			break;
	if (!found || ret) {
		/* @found is true, @ret is 0, error handling for csum
		 * tree is skipped.
		 */
	}

This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if
the csum tree is corrupted, which results unexpected later csum lookup.

[FIX]
If read_tree_root_path() failed, always populate @ret to the error
number.

As at the end of the function, we need @ret to determine if we need to
do the extra error handling for csum tree.

Fixes: abed4aaae4 ("btrfs: track the csum, extent, and free space trees in a rb tree")
Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com>
Reported-by: Longxing Li <coregee2000@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:21:06 +02:00
Filipe Manana
a26bf338cd btrfs: fix race between async reclaim worker and close_ctree()
Syzbot reported an assertion failure due to an attempt to add a delayed
iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info
state:

  WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
  Modules linked in:
  CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
  Workqueue: btrfs-endio-write btrfs_work_helper
  RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
  Code: 4e ad 5d (...)
  RSP: 0018:ffffc9000213f780 EFLAGS: 00010293
  RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00
  RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000
  RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c
  R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001
  R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100
  FS:  0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635
   btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312
   btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312
   process_one_work kernel/workqueue.c:3238 [inline]
   process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321
   worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
   kthread+0x70e/0x8a0 kernel/kthread.c:464
   ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
   </TASK>

This can happen due to a race with the async reclaim worker like this:

1) The async metadata reclaim worker enters shrink_delalloc(), which calls
   btrfs_start_delalloc_roots() with an nr_pages argument that has a value
   less than LONG_MAX, and that in turn enters start_delalloc_inodes(),
   which sets the local variable 'full_flush' to false because
   wbc->nr_to_write is less than LONG_MAX;

2) There it finds inode X in a root's delalloc list, grabs a reference for
   inode X (with igrab()), and triggers writeback for it with
   filemap_fdatawrite_wbc(), which creates an ordered extent for inode X;

3) The unmount sequence starts from another task, we enter close_ctree()
   and we flush the workqueue fs_info->endio_write_workers, which waits
   for the ordered extent for inode X to complete and when dropping the
   last reference of the ordered extent, with btrfs_put_ordered_extent(),
   when we call btrfs_add_delayed_iput() we don't add the inode to the
   list of delayed iputs because it has a refcount of 2, so we decrement
   it to 1 and return;

4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which
   runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT
   in the fs_info state;

5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now
   calls btrfs_add_delayed_iput() for inode X and there we trigger an
   assertion failure since the fs_info state has the flag
   BTRFS_FS_STATE_NO_DELAYED_IPUT set.

Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for
the async reclaim workers to finish, after we call cancel_work_sync() for
them at close_ctree(), and by running delayed iputs after wait for the
reclaim workers to finish and before setting the bit.

This race was recently introduced by commit 19e60b2a95 ("btrfs: add
extra warning if delayed iput is added when it's not allowed"). Without
the new validation at btrfs_add_delayed_iput(), this described scenario
was safe because close_ctree() later calls btrfs_commit_super(). That
will run any final delayed iputs added by reclaim workers in the window
between the btrfs_run_delayed_iputs() and the the reclaim workers being
shut down.

Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u
Fixes: 19e60b2a95 ("btrfs: add extra warning if delayed iput is added when it's not allowed")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:20:57 +02:00
Leo Martins
186b9dc3c3 btrfs: warn if leaking delayed_nodes in btrfs_put_root()
Add a warning for leaked delayed_nodes when putting a root. We currently
do this for inodes, but not delayed_nodes.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Remove the changelog from the commit message. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:18:39 +02:00
Filipe Manana
4469e95fe5 btrfs: log error codes during failures when writing super blocks
When writing super blocks, at write_dev_supers(), we log an error message
when we get some error but we don't show which error we got and we have
that information. So enhance the error messages with the error codes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:57 +02:00
Naohiro Aota
635da7ea9a btrfs: add block reserve for treelog
We need to add a dedicated block_rsv for tree-log, because the block_rsv
serves for a tree node allocation in btrfs_alloc_tree_block(). Currently,
tree-log tree uses fs_info->empty_block_rsv, which is shared across trees
and points to the normal metadata space_info. Instead, we add a dedicated
block_rsv and that block_rsv can use the dedicated sub-space_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:53 +02:00
Qu Wenruo
007fa63225 btrfs: get rid of btrfs_read_dev_super()
The function is introduced by commit a512bbf855 ("Btrfs: superblock
duplication") at the beginning of btrfs.

It leaved a comment saying we'd need a special mount option to read all
super blocks, but it's never been implemented and there was not
need/request for it. The check/rescue tools are able to start from a
specific copy and use it as primary eventually.

This means btrfs_read_dev_super() is always reading the first super
block, making all the code finding the latest super block unnecessary.

Just remove that function and replace all call sites with
btrfs_read_disk_super(bdev, 0, false).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Qu Wenruo
63f32b7b5d btrfs: merge btrfs_read_dev_one_super() into btrfs_read_disk_super()
We have two functions to read a super block from a block device:

- btrfs_read_dev_one_super()
  Exported from disk-io.c

- btrfs_read_disk_super()
  Local to volumes.c

And they have some minor differences:

- btrfs_read_dev_one_super() uses @copy_num
  Meanwhile btrfs_read_disk_super() relies on the physical and expected
  bytenr passed from the caller.

  The parameter list of btrfs_read_dev_one_super() is more user
  friendly.

- btrfs_read_disk_super() makes sure the label is NUL terminated

We do not need two different functions doing the same job, so merge the
behavior into btrfs_read_disk_super() by:

- Remove btrfs_read_dev_one_super()

- Export btrfs_read_disk_super()
  The name pairs with btrfs_release_disk_super() perfectly.

- Change the parameter list of btrfs_read_disk_super() to mimic
  btrfs_read_dev_one_super()
  All existing callers are calculating the physical address and expect
  bytenr before calling btrfs_read_disk_super() already.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Josef Bacik
19d7f65f03 btrfs: convert the buffer_radix to an xarray
In order to fully utilize xarray tagging to improve writeback we need to
convert the buffer_radix to a proper xarray.  This conversion is
relatively straightforward as the radix code uses the xarray underneath.
Using xarray directly allows for quite a lot less code.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
David Sterba
656e9f51de btrfs: rename btrfs_discard workqueue to btrfs-discard
We use the "btrfs-" prefix for our workqueues, the discard has
underscore instead of dash, so unify it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
a24d185c36 btrfs: change return type of btree_csum_one_bio() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
c0ee55f796 btrfs: merge __setup_root() to btrfs_alloc_root()
There's only one caller of __setup_root() so merge it there.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
2d44a15afd btrfs: use list_first_entry() everywhere
Using the helper makes it a bit more clear that we're accessing the
first list entry.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:47 +02:00
David Sterba
ed50ab0fec btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARN
Use the conditional warning instead of typing the whole condition.
Optional message is printed where it seems clear what could be the
problem.

Conversion is left out in btree_csum_one_bio() because of the additional
condition.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:47 +02:00
Christoph Hellwig
3240b2c97b btrfs: pass a physical address to btrfs_repair_io_failure()
Using physical address has the following advantages:

- All involved callers only need a single pointer
  Instead of the old @folio + @offset pair.

- No complex poking into the bio_vec structure
  As a bio_vec can be single or multiple paged, grabbing the real page
  can be quite complex if the bio_vec is a multi-page one.

  Instead bvec_phys() will always give a single physical address, and it
  cab be easily converted to a page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:46 +02:00
Filipe Manana
d846a6d3b0 btrfs: rename remaining exported extent map functions
Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:45 +02:00
Filipe Manana
b351161f4f btrfs: rename free_extent_state() to include a btrfs prefix
This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.

Rename the function to add 'btrfs_' prefix to it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:44 +02:00
Filipe Manana
e965835c98 btrfs: rename the functions to init and release an extent io tree
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.

So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:44 +02:00
Filipe Manana
66da9c1bed btrfs: rename the functions to search for bits in extent ranges
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.

So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:44 +02:00
Filipe Manana
9d222562b4 btrfs: rename the functions to clear bits for an extent range
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. One of them has a
double underscore prefix which is also discouraged.

So remove double underscore prefix where applicable and add a 'btrfs_'
prefix to their name to make it clear they are from btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:43 +02:00
Daniel Vacek
c61660ec34 btrfs: remove unused flag EXTENT_BUFFER_CORRUPT
This flag is no longer being used.  It was added by commit a826d6dcb3
("Btrfs: check items for correctness as we search") but it's no longer
being used after commit f26c923860 ("btrfs: remove reada
infrastructure").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:39 +02:00
Linus Torvalds
0cb9ce06a6 for-6.15-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgAzO8ACgkQxWXV+ddt
 WDve6g//UWZ24/wLOoFC4u2wwuctnWy5FFOrvk0IqdxWzuSjA1Ou1P4WfD2xlnQv
 wFqYk2SIuP68WQhd09Oj1WRQ9SbJIgAwITeryw4lFYq8v1q8xFB5kM0UTLXXlaNH
 O342UK7HRW7XfXD9VkcQz5wXQvk0i7pmtZTjiD1QBbWS+qlEc5YQiZnMRlUlQKBw
 85JM45iOFwHJLVt+A8ydC1yMdP7xktiVEhlPsjvzqUKs8orquuikxSW5d/WlDc9g
 OeOf9pvxSNf3zsAzmwUrEOxsn3fLFFjoaPxDpfn42BsN4FcyIv4l9K9HdkcdzrLY
 Gu0QaDVGCb6bXYhioyEzv/mzESQzOTWQUzI2fJrPPquwH9g0dss9uQwOwaOWbfHO
 MDF7fBVwnChaC0O8NoKk5H8jQAXxPfAuU1JpypKOORuffTVz7uG3xkK56VJ/kfTh
 qgqRImNGTuAu0C0xGdUjngpOfRypDQLQTo58AubLFAWjqD4elOFjanc/6xobYAJi
 PnPk132yKxAdR9h4+1YUk1lzaauDinNzErt+vpUQ/g2QL9PtUbp1IG7VF9llGDzO
 hqlifRBHcNy7cKNirFX0PYCke8fXrsKC1NbNiAQMjuK7agzg3b/+PW05EFLQv3EU
 6CNgukLG8XbfK2F7PMwmno4zUXbA5JA2mxnKr4vRIMrGZVBTcTo=
 =HZ/U
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - handle encoded read ioctl returning EAGAIN so it does not mistakenly
   free the work structure

 - escape subvolume path in mount option list so it cannot be wrongly
   parsed when the path contains ","

 - remove folio size assertions when writing super block to device with
   enabled large folios

* tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: remove folio order ASSERT()s in super block writeback path
  btrfs: correctly escape subvol in btrfs_show_options()
  btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
2025-04-17 09:17:57 -07:00
Linus Torvalds
d6b02199cd - The 7 patch series "powerpc/crash: use generic crashkernel
reservation" from Sourabh Jain changes powerpc's kexec code to use more
   of the generic layers.
 
 - The 2 patch series "get_maintainer: report subsystem status
   separately" from Vlastimil Babka makes some long-requested improvements
   to the get_maintainer output.
 
 - The 4 patch series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the ucount
   code.
 
 - The 12 patch series "reboot: support runtime configuration of
   emergency hw_protection action" from Ahmad Fatoum improves the ability
   for a driver to perform an emergency system shutdown or reboot.
 
 - The 16 patch series "Converge on using secs_to_jiffies() part two"
   from Easwar Hariharan performs further migrations from
   msecs_to_jiffies() to secs_to_jiffies().
 
 - The 7 patch series "lib/interval_tree: add some test cases and
   cleanup" from Wei Yang permits more userspace testing of kernel library
   code, adds some more tests and performs some cleanups.
 
 - The 2 patch series "hung_task: Dump the blocking task stacktrace" from
   Masami Hiramatsu arranges for the hung_task detector to dump the stack
   of the blocking task and not just that of the blocked task.
 
 - The 4 patch series "resource: Split and use DEFINE_RES*() macros" from
   Andy Shevchenko provides some cleanups to the resource definition
   macros.
 
 - Plus the usual shower of singleton patches - please see the individual
   changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nuqwAKCRDdBJ7gKXxA
 jtNqAQDxqJpjWkzn4yN9CNSs1ivVx3fr6SqazlYCrt3u89WQvwEA1oRrGpETzUGq
 r6khQUIcQImPPcjFqEFpuiSOU0MBZA0=
 =Kii8
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - The series "powerpc/crash: use generic crashkernel reservation" from
   Sourabh Jain changes powerpc's kexec code to use more of the generic
   layers.

 - The series "get_maintainer: report subsystem status separately" from
   Vlastimil Babka makes some long-requested improvements to the
   get_maintainer output.

 - The series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the
   ucount code.

 - The series "reboot: support runtime configuration of emergency
   hw_protection action" from Ahmad Fatoum improves the ability for a
   driver to perform an emergency system shutdown or reboot.

 - The series "Converge on using secs_to_jiffies() part two" from Easwar
   Hariharan performs further migrations from msecs_to_jiffies() to
   secs_to_jiffies().

 - The series "lib/interval_tree: add some test cases and cleanup" from
   Wei Yang permits more userspace testing of kernel library code, adds
   some more tests and performs some cleanups.

 - The series "hung_task: Dump the blocking task stacktrace" from Masami
   Hiramatsu arranges for the hung_task detector to dump the stack of
   the blocking task and not just that of the blocked task.

 - The series "resource: Split and use DEFINE_RES*() macros" from Andy
   Shevchenko provides some cleanups to the resource definition macros.

 - Plus the usual shower of singleton patches - please see the
   individual changelogs for details.

* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  mailmap: consolidate email addresses of Alexander Sverdlin
  fs/procfs: fix the comment above proc_pid_wchan()
  relay: use kasprintf() instead of fixed buffer formatting
  resource: replace open coded variant of DEFINE_RES()
  resource: replace open coded variants of DEFINE_RES_*_NAMED()
  resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
  resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
  samples: add hung_task detector mutex blocking sample
  hung_task: show the blocker task if the task is hung on mutex
  kexec_core: accept unaccepted kexec segments' destination addresses
  watchdog/perf: optimize bytes copied and remove manual NUL-termination
  lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
  lib/interval_tree: skip the check before go to the right subtree
  lib/interval_tree: add test case for span iteration
  lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
  lib/rbtree: add random seed
  lib/rbtree: split tests
  lib/rbtree: enable userland test suite for rbtree related data structure
  checkpatch: describe --min-conf-desc-length
  scripts/gdb/symbols: determine KASLR offset on s390
  ...
2025-04-01 10:06:52 -07:00
Qu Wenruo
65f2a3b232 btrfs: remove folio order ASSERT()s in super block writeback path
[BUG]
There is a syzbot report that the ASSERT() inside write_dev_supers() got
triggered:

  assertion failed: folio_order(folio) == 0, in fs/btrfs/disk-io.c:3858
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/disk-io.c:3858!
  Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
  CPU: 0 UID: 0 PID: 6730 Comm: syz-executor378 Not tainted 6.14.0-syzkaller-03565-gf6e0150b2003 #0 PREEMPT(full)
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  RIP: 0010:write_dev_supers fs/btrfs/disk-io.c:3858 [inline]
  RIP: 0010:write_all_supers+0x400f/0x4090 fs/btrfs/disk-io.c:4155
  Call Trace:
   <TASK>
   btrfs_commit_transaction+0x1eda/0x3750 fs/btrfs/transaction.c:2528
   btrfs_quota_enable+0xfcc/0x21a0 fs/btrfs/qgroup.c:1226
   btrfs_ioctl_quota_ctl+0x144/0x1c0 fs/btrfs/ioctl.c:3677
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:906 [inline]
   __se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f5ad1f20289
   </TASK>
  ---[ end trace 0000000000000000 ]---

[CAUSE]
Since commit f93ee0df51 ("btrfs: convert super block writes to folio
in write_dev_supers()") and commit c94b7349b8 ("btrfs: convert super
block writes to folio in wait_dev_supers()"), the super block writeback
path is converted to use folio.

Since the original code is using page based interfaces, we have an
"ASSERT(folio_order(folio) == 0);" added to make sure everything is not
changed.

But the folio here is not from any btrfs inode, but from the block
device, and we have no control on the folio order in bdev, the device
can choose whatever folio size they want/need.

E.g. the bdev may even have a block size of multiple pages.

So the ASSERT() is triggered.

[FIX]
The super block writeback path has taken larger folios into
consideration, so there is no need for the ASSERT().

And since commit bc00965dbf ("btrfs: count super block write errors in
device instead of tracking folio error state"), the wait path no longer
checks the folio status but only wait for the folio writeback to finish,
there is nothing requiring the ASSERT() either.

So we can remove both ASSERT()s safely now.

Reported-by: syzbot+34122898a11ab689518a@syzkaller.appspotmail.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-01 01:02:42 +02:00
Mark Harmstone
9db9c7dd5b btrfs: don't clobber ret in btrfs_validate_super()
Commit 2a9bb78cfd ("btrfs: validate system chunk array at
btrfs_validate_super()") introduces a call to validate_sys_chunk_array()
in btrfs_validate_super(), which clobbers the value of ret set earlier.
This has the effect of negating the validity checks done earlier, making
it so btrfs could potentially try to mount invalid filesystems.

Fixes: 2a9bb78cfd ("btrfs: validate system chunk array at btrfs_validate_super()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:54 +01:00
Qu Wenruo
19e60b2a95 btrfs: add extra warning if delayed iput is added when it's not allowed
Since I have triggered the ASSERT() on the delayed iput too many times,
now is the time to add some extra debug warnings for delayed iput.

All delayed iputs should be queued after all ordered extents finish
their IO and all involved workqueues are flushed.

Thus after the btrfs_run_delayed_iputs() inside close_ctree(), there
should be no more delayed puts added.

So introduce a new BTRFS_FS_STATE_NO_DELAYED_IPUT, set after the above
mentioned timing.  And all btrfs_add_delayed_iput() will check that flag
and give a WARN_ON_ONCE().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:52 +01:00
Qu Wenruo
df94a342ef btrfs: run btrfs_error_commit_super() early
[BUG]
Even after all the error fixes related the
"ASSERT(list_empty(&fs_info->delayed_iputs));" in close_ctree(), I can
still hit it reliably with my experimental 2K block size.

[CAUSE]
In my case, all the error is triggered after the fs is already in error
status.

I find the following call trace to be the cause of race:

           Main thread                       |     endio_write_workers
---------------------------------------------+---------------------------
close_ctree()                                |
|- btrfs_error_commit_super()                |
|  |- btrfs_cleanup_transaction()            |
|  |  |- btrfs_destroy_all_ordered_extents() |
|  |     |- btrfs_wait_ordered_roots()       |
|  |- btrfs_run_delayed_iputs()              |
|                                            | btrfs_finish_ordered_io()
|                                            | |- btrfs_put_ordered_extent()
|                                            |    |- btrfs_add_delayed_iput()
|- ASSERT(list_empty(delayed_iputs))         |
   !!! Triggered !!!

The root cause is that, btrfs_wait_ordered_roots() only wait for
ordered extents to finish their IOs, not to wait for them to finish and
removed.

[FIX]
Since btrfs_error_commit_super() will flush and wait for all ordered
extents, it should be executed early, before we start flushing the
workqueues.

And since btrfs_error_commit_super() now runs early, there is no need to
run btrfs_run_delayed_iputs() inside it, so just remove the
btrfs_run_delayed_iputs() call from btrfs_error_commit_super().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
cda76788f8 btrfs: fix non-empty delayed iputs list on unmount due to async workers
At close_ctree() after we have ran delayed iputs either explicitly through
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.

We have (another) race where this assertion might fail because we have
queued an async write into the fs_info->workers workqueue. Here's how it
happens:

1) We are submitting a data bio for an inode that is not the data
   relocation inode, so we call btrfs_wq_submit_bio();

2) btrfs_wq_submit_bio() submits a work for the fs_info->workers queue
   that will run run_one_async_done();

3) We enter close_ctree(), flush several work queues except
   fs_info->workers, explicitly run delayed iputs with a call to
   btrfs_run_delayed_iputs() and then again shortly after by calling
   btrfs_commit_super() or btrfs_error_commit_super(), which also run
   delayed iputs;

4) run_one_async_done() is executed in the work queue, and because there
   was an IO error (bio->bi_status is not 0) it calls btrfs_bio_end_io(),
   which drops the final reference on the associated ordered extent by
   calling btrfs_put_ordered_extent() - and that adds a delayed iput for
   the inode;

5) At close_ctree() we find that after stopping the cleaner and
   transaction kthreads the delayed iputs list is not empty, failing the
   following assertion:

      ASSERT(list_empty(&fs_info->delayed_iputs));

Fix this by flushing the fs_info->workers workqueue before running delayed
iputs at close_ctree().

David reported this when running generic/648, which exercises IO error
paths by using the DM error table.

Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
4c782247b8 btrfs: fix non-empty delayed iputs list on unmount due to compressed write workers
At close_ctree() after we have ran delayed iputs either through explicitly
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.

When we have compressed writes this assertion may fail because delayed
iputs may have been added to the list after we last ran delayed iputs.
This happens like this:

1) We have a compressed write bio executing;

2) We enter close_ctree() and flush the fs_info->endio_write_workers
   queue which is the queue used for running ordered extent completion;

3) The compressed write bio finishes and enters
   btrfs_finish_compressed_write_work(), where it calls
   btrfs_finish_ordered_extent() which in turn calls
   btrfs_queue_ordered_fn(), which queues a work item in the
   fs_info->endio_write_workers queue that we have flushed before;

4) At close_ctree() we proceed, run all existing delayed iputs and
   call btrfs_commit_super() (which also runs delayed iputs), but before
   we run the following assertion below:

      ASSERT(list_empty(&fs_info->delayed_iputs))

   A delayed iput is added by the step below...

5) The ordered extent completion job queued in step 3 runs and results in
   creating a delayed iput when dropping the last reference of the ordered
   extent (a call to btrfs_put_ordered_extent() made from
   btrfs_finish_one_ordered());

6) At this point the delayed iputs list is not empty, so the assertion at
   close_ctree() fails.

Fix this by flushing the fs_info->compressed_write_workers queue at
close_ctree() before flushing the fs_info->endio_write_workers queue,
respecting the queue dependency as the later is responsible for the
execution of ordered extent completion.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Qu Wenruo
306a75e647 btrfs: allow debug builds to accept 2K block size
Currently we only support two block sizes, 4K and PAGE_SIZE.

This means on the most common architecture x86_64, we have no way to
test subpage block size.  And that's exactly I have an aarch64 machine
dedicated for subpage tests.

But this is still a hurdle for a lot of btrfs developers, and to improve
the test coverage mostly on x86_64, here we enable debug builds to
accept 2K block size.

This involves:

- Introduce a dedicated minimal block size macro
  BTRFS_MIN_BLOCKSIZE, which depends on if CONFIG_BTRFS_DEBUG is set.
  If so it's 2K, otherwise it's 4K as usual.

- Allow 4K, PAGE_SIZE and BTRFS_MIN_BLOCKSIZE as block size

- Update subpage block size checks to be based on BTRFS_MIN_BLOCKSIZE

- Export the new supported blocksize through sysfs interfaces

As most of the subpage support is already pretty mature, there is no
extra work needed to support the extra 2K block size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Qu Wenruo
2ef9d73f2b btrfs: remove the subpage related warning message
Since the initial enablement of block size < page size support for
btrfs in v5.15, we have hit several milestones for block size < page
size (subpage) support:

- RAID56 subpage support
  In v5.19

- Refactored scrub support to support subpage better
  In v6.4

- Block perfect (previously requires page aligned ranges) compressed write
  In v6.13

- Various error handling fixes involving subpage
  In v6.14

Finally the only missing feature is the pretty simple and harmless
inlined data extent creation, just added in previous patches.

Now btrfs has all of its features ready for both regular and subpage
cases, there is no reason to output a warning about the experimental
subpage support, and we can finally remove it now.

Acked-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba
72f2bae3c1 btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_init_root_free_objectid()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
aaa5ae8f6d btrfs: use BTRFS_PATH_AUTO_FREE in load_global_roots()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
efac576c22 btrfs: do trivial BTRFS_PATH_AUTO_FREE conversions
The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
bd06bce1b3 btrfs: use num_extent_folios() in for loop bounds
As the helper num_extent_folios() is now __pure, we can use it in for
loop without storing its value in a variable explicitly, the compiler
will do this for us.

The effects on btrfs.ko is -200 bytes and there are stack space savings
too:

btrfs_clone_extent_buffer                               -8 (32 -> 24)
btrfs_clear_buffer_dirty                                -8 (48 -> 40)
clear_extent_buffer_uptodate                            -8 (40 -> 32)
set_extent_buffer_dirty                                 -8 (32 -> 24)
write_one_eb                                            -8 (88 -> 80)
set_extent_buffer_uptodate                              -8 (40 -> 32)
read_extent_buffer_pages_nowait                        -16 (64 -> 48)
find_extent_buffer                                      -8 (32 -> 24)

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
b7226ce6c4 btrfs: simplify parameters of metadata folio helpers
Unlike folio helpers for date the ones for metadata always take the
extent buffer start and length, so they can be simplified to take the
eb only.  The fs_info can be obtained from eb too so it can be dropped
as parameter.

Added in patch "btrfs: use metadata specific helpers to simplify extent
buffer helpers".

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
f867ccabb8 btrfs: simplify returns and labels in btrfs_init_fs_root()
There's a label that does nothing else than return, so remove it and
also change other gotos to immediate returns as the function is short
enough for this pattern.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
Qu Wenruo
fcc384be06 btrfs: require strict data/metadata split for subpage checks
Since we have btrfs_meta_is_subpage(), we should make btrfs_is_subpage()
to be data inode specific.

This change involves:

- Simplify btrfs_is_subpage()
  Now we only need to do a very simple sectorsize check against
  PAGE_SIZE.
  And since the function is pretty simple now, just make it an inline
  function.

- Add an extra ASSERT() to make sure btrfs_is_subpage() is only called
  on data inode mapping

- Migrate btree_csum_one_bio() to use btrfs_meta_folio_*() helpers
- Migrate alloc_extent_buffer() to use btrfs_meta_folio_*() helpers
- Migrate end_bbio_meta_write() to use btrfs_meta_folio_*() helpers
  Or we will trigger the ASSERT() due to calling btrfs_folio_*() on
  metadata folios.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
Qu Wenruo
619611e87f btrfs: remove btrfs_fs_info::sectors_per_page
For the future large folio support, our filemap can have folios with
different sizes, thus we can no longer rely on a fixed blocks_per_page
value.

To prepare for that future, here we do:

- Remove btrfs_fs_info::sectors_per_page

- Introduce a helper, btrfs_blocks_per_folio()
  Which uses the folio size to calculate the number of blocks for each
  folio.

- Migrate the existing btrfs_fs_info::sectors_per_page to use that
  helper
  There are some exceptions:

  * Metadata nodesize < page size support
    In the future, even if we support large folios, we will only
    allocate a folio that matches our nodesize.
    Thus we won't have a folio covering multiple metadata unless
    nodesize < page size.

  * Existing subpage bitmap dump
    We use a single unsigned long to store the bitmap.
    That means until we change the bitmap dumping code, our upper limit
    for folio size will only be 256K (4K block size, 64 bit unsigned
    long).

  * btrfs_is_subpage() check
    This will be migrated into a future patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Easwar Hariharan
8ef9019ed2 btrfs: convert timeouts to secs_to_jiffies()
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies().  As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies() to avoid the
multiplication

This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:

@depends on patch@
expression E;
@@

-msecs_to_jiffies
+secs_to_jiffies
(E
- * \( 1000 \| MSEC_PER_SEC \)
)

Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-5-a43967e36c88@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Dongsheng Yang <dongsheng.yang@easystack.cn>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Frank Li <frank.li@nxp.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Brown <broonie@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 23:24:16 -07:00
David Sterba
248c4ff393 btrfs: split waiting from read_extent_buffer_pages(), drop parameter wait
There are only 2 WAIT_* values left for wait parameter, we can encode
this to the function name if the waiting functionality is split.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00