2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

1840 Commits

Author SHA1 Message Date
Filipe Manana
4469e95fe5 btrfs: log error codes during failures when writing super blocks
When writing super blocks, at write_dev_supers(), we log an error message
when we get some error but we don't show which error we got and we have
that information. So enhance the error messages with the error codes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:57 +02:00
Naohiro Aota
635da7ea9a btrfs: add block reserve for treelog
We need to add a dedicated block_rsv for tree-log, because the block_rsv
serves for a tree node allocation in btrfs_alloc_tree_block(). Currently,
tree-log tree uses fs_info->empty_block_rsv, which is shared across trees
and points to the normal metadata space_info. Instead, we add a dedicated
block_rsv and that block_rsv can use the dedicated sub-space_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:53 +02:00
Qu Wenruo
007fa63225 btrfs: get rid of btrfs_read_dev_super()
The function is introduced by commit a512bbf855 ("Btrfs: superblock
duplication") at the beginning of btrfs.

It leaved a comment saying we'd need a special mount option to read all
super blocks, but it's never been implemented and there was not
need/request for it. The check/rescue tools are able to start from a
specific copy and use it as primary eventually.

This means btrfs_read_dev_super() is always reading the first super
block, making all the code finding the latest super block unnecessary.

Just remove that function and replace all call sites with
btrfs_read_disk_super(bdev, 0, false).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Qu Wenruo
63f32b7b5d btrfs: merge btrfs_read_dev_one_super() into btrfs_read_disk_super()
We have two functions to read a super block from a block device:

- btrfs_read_dev_one_super()
  Exported from disk-io.c

- btrfs_read_disk_super()
  Local to volumes.c

And they have some minor differences:

- btrfs_read_dev_one_super() uses @copy_num
  Meanwhile btrfs_read_disk_super() relies on the physical and expected
  bytenr passed from the caller.

  The parameter list of btrfs_read_dev_one_super() is more user
  friendly.

- btrfs_read_disk_super() makes sure the label is NUL terminated

We do not need two different functions doing the same job, so merge the
behavior into btrfs_read_disk_super() by:

- Remove btrfs_read_dev_one_super()

- Export btrfs_read_disk_super()
  The name pairs with btrfs_release_disk_super() perfectly.

- Change the parameter list of btrfs_read_disk_super() to mimic
  btrfs_read_dev_one_super()
  All existing callers are calculating the physical address and expect
  bytenr before calling btrfs_read_disk_super() already.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Josef Bacik
19d7f65f03 btrfs: convert the buffer_radix to an xarray
In order to fully utilize xarray tagging to improve writeback we need to
convert the buffer_radix to a proper xarray.  This conversion is
relatively straightforward as the radix code uses the xarray underneath.
Using xarray directly allows for quite a lot less code.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
David Sterba
656e9f51de btrfs: rename btrfs_discard workqueue to btrfs-discard
We use the "btrfs-" prefix for our workqueues, the discard has
underscore instead of dash, so unify it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
a24d185c36 btrfs: change return type of btree_csum_one_bio() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
c0ee55f796 btrfs: merge __setup_root() to btrfs_alloc_root()
There's only one caller of __setup_root() so merge it there.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
2d44a15afd btrfs: use list_first_entry() everywhere
Using the helper makes it a bit more clear that we're accessing the
first list entry.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:47 +02:00
David Sterba
ed50ab0fec btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARN
Use the conditional warning instead of typing the whole condition.
Optional message is printed where it seems clear what could be the
problem.

Conversion is left out in btree_csum_one_bio() because of the additional
condition.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:47 +02:00
Christoph Hellwig
3240b2c97b btrfs: pass a physical address to btrfs_repair_io_failure()
Using physical address has the following advantages:

- All involved callers only need a single pointer
  Instead of the old @folio + @offset pair.

- No complex poking into the bio_vec structure
  As a bio_vec can be single or multiple paged, grabbing the real page
  can be quite complex if the bio_vec is a multi-page one.

  Instead bvec_phys() will always give a single physical address, and it
  cab be easily converted to a page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:46 +02:00
Filipe Manana
d846a6d3b0 btrfs: rename remaining exported extent map functions
Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:45 +02:00
Filipe Manana
b351161f4f btrfs: rename free_extent_state() to include a btrfs prefix
This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.

Rename the function to add 'btrfs_' prefix to it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:44 +02:00
Filipe Manana
e965835c98 btrfs: rename the functions to init and release an extent io tree
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.

So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:44 +02:00
Filipe Manana
66da9c1bed btrfs: rename the functions to search for bits in extent ranges
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.

So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:44 +02:00
Filipe Manana
9d222562b4 btrfs: rename the functions to clear bits for an extent range
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. One of them has a
double underscore prefix which is also discouraged.

So remove double underscore prefix where applicable and add a 'btrfs_'
prefix to their name to make it clear they are from btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:43 +02:00
Daniel Vacek
c61660ec34 btrfs: remove unused flag EXTENT_BUFFER_CORRUPT
This flag is no longer being used.  It was added by commit a826d6dcb3
("Btrfs: check items for correctness as we search") but it's no longer
being used after commit f26c923860 ("btrfs: remove reada
infrastructure").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:39 +02:00
Linus Torvalds
0cb9ce06a6 for-6.15-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgAzO8ACgkQxWXV+ddt
 WDve6g//UWZ24/wLOoFC4u2wwuctnWy5FFOrvk0IqdxWzuSjA1Ou1P4WfD2xlnQv
 wFqYk2SIuP68WQhd09Oj1WRQ9SbJIgAwITeryw4lFYq8v1q8xFB5kM0UTLXXlaNH
 O342UK7HRW7XfXD9VkcQz5wXQvk0i7pmtZTjiD1QBbWS+qlEc5YQiZnMRlUlQKBw
 85JM45iOFwHJLVt+A8ydC1yMdP7xktiVEhlPsjvzqUKs8orquuikxSW5d/WlDc9g
 OeOf9pvxSNf3zsAzmwUrEOxsn3fLFFjoaPxDpfn42BsN4FcyIv4l9K9HdkcdzrLY
 Gu0QaDVGCb6bXYhioyEzv/mzESQzOTWQUzI2fJrPPquwH9g0dss9uQwOwaOWbfHO
 MDF7fBVwnChaC0O8NoKk5H8jQAXxPfAuU1JpypKOORuffTVz7uG3xkK56VJ/kfTh
 qgqRImNGTuAu0C0xGdUjngpOfRypDQLQTo58AubLFAWjqD4elOFjanc/6xobYAJi
 PnPk132yKxAdR9h4+1YUk1lzaauDinNzErt+vpUQ/g2QL9PtUbp1IG7VF9llGDzO
 hqlifRBHcNy7cKNirFX0PYCke8fXrsKC1NbNiAQMjuK7agzg3b/+PW05EFLQv3EU
 6CNgukLG8XbfK2F7PMwmno4zUXbA5JA2mxnKr4vRIMrGZVBTcTo=
 =HZ/U
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - handle encoded read ioctl returning EAGAIN so it does not mistakenly
   free the work structure

 - escape subvolume path in mount option list so it cannot be wrongly
   parsed when the path contains ","

 - remove folio size assertions when writing super block to device with
   enabled large folios

* tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: remove folio order ASSERT()s in super block writeback path
  btrfs: correctly escape subvol in btrfs_show_options()
  btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
2025-04-17 09:17:57 -07:00
Linus Torvalds
d6b02199cd - The 7 patch series "powerpc/crash: use generic crashkernel
reservation" from Sourabh Jain changes powerpc's kexec code to use more
   of the generic layers.
 
 - The 2 patch series "get_maintainer: report subsystem status
   separately" from Vlastimil Babka makes some long-requested improvements
   to the get_maintainer output.
 
 - The 4 patch series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the ucount
   code.
 
 - The 12 patch series "reboot: support runtime configuration of
   emergency hw_protection action" from Ahmad Fatoum improves the ability
   for a driver to perform an emergency system shutdown or reboot.
 
 - The 16 patch series "Converge on using secs_to_jiffies() part two"
   from Easwar Hariharan performs further migrations from
   msecs_to_jiffies() to secs_to_jiffies().
 
 - The 7 patch series "lib/interval_tree: add some test cases and
   cleanup" from Wei Yang permits more userspace testing of kernel library
   code, adds some more tests and performs some cleanups.
 
 - The 2 patch series "hung_task: Dump the blocking task stacktrace" from
   Masami Hiramatsu arranges for the hung_task detector to dump the stack
   of the blocking task and not just that of the blocked task.
 
 - The 4 patch series "resource: Split and use DEFINE_RES*() macros" from
   Andy Shevchenko provides some cleanups to the resource definition
   macros.
 
 - Plus the usual shower of singleton patches - please see the individual
   changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nuqwAKCRDdBJ7gKXxA
 jtNqAQDxqJpjWkzn4yN9CNSs1ivVx3fr6SqazlYCrt3u89WQvwEA1oRrGpETzUGq
 r6khQUIcQImPPcjFqEFpuiSOU0MBZA0=
 =Kii8
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - The series "powerpc/crash: use generic crashkernel reservation" from
   Sourabh Jain changes powerpc's kexec code to use more of the generic
   layers.

 - The series "get_maintainer: report subsystem status separately" from
   Vlastimil Babka makes some long-requested improvements to the
   get_maintainer output.

 - The series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the
   ucount code.

 - The series "reboot: support runtime configuration of emergency
   hw_protection action" from Ahmad Fatoum improves the ability for a
   driver to perform an emergency system shutdown or reboot.

 - The series "Converge on using secs_to_jiffies() part two" from Easwar
   Hariharan performs further migrations from msecs_to_jiffies() to
   secs_to_jiffies().

 - The series "lib/interval_tree: add some test cases and cleanup" from
   Wei Yang permits more userspace testing of kernel library code, adds
   some more tests and performs some cleanups.

 - The series "hung_task: Dump the blocking task stacktrace" from Masami
   Hiramatsu arranges for the hung_task detector to dump the stack of
   the blocking task and not just that of the blocked task.

 - The series "resource: Split and use DEFINE_RES*() macros" from Andy
   Shevchenko provides some cleanups to the resource definition macros.

 - Plus the usual shower of singleton patches - please see the
   individual changelogs for details.

* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  mailmap: consolidate email addresses of Alexander Sverdlin
  fs/procfs: fix the comment above proc_pid_wchan()
  relay: use kasprintf() instead of fixed buffer formatting
  resource: replace open coded variant of DEFINE_RES()
  resource: replace open coded variants of DEFINE_RES_*_NAMED()
  resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
  resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
  samples: add hung_task detector mutex blocking sample
  hung_task: show the blocker task if the task is hung on mutex
  kexec_core: accept unaccepted kexec segments' destination addresses
  watchdog/perf: optimize bytes copied and remove manual NUL-termination
  lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
  lib/interval_tree: skip the check before go to the right subtree
  lib/interval_tree: add test case for span iteration
  lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
  lib/rbtree: add random seed
  lib/rbtree: split tests
  lib/rbtree: enable userland test suite for rbtree related data structure
  checkpatch: describe --min-conf-desc-length
  scripts/gdb/symbols: determine KASLR offset on s390
  ...
2025-04-01 10:06:52 -07:00
Qu Wenruo
65f2a3b232 btrfs: remove folio order ASSERT()s in super block writeback path
[BUG]
There is a syzbot report that the ASSERT() inside write_dev_supers() got
triggered:

  assertion failed: folio_order(folio) == 0, in fs/btrfs/disk-io.c:3858
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/disk-io.c:3858!
  Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
  CPU: 0 UID: 0 PID: 6730 Comm: syz-executor378 Not tainted 6.14.0-syzkaller-03565-gf6e0150b2003 #0 PREEMPT(full)
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  RIP: 0010:write_dev_supers fs/btrfs/disk-io.c:3858 [inline]
  RIP: 0010:write_all_supers+0x400f/0x4090 fs/btrfs/disk-io.c:4155
  Call Trace:
   <TASK>
   btrfs_commit_transaction+0x1eda/0x3750 fs/btrfs/transaction.c:2528
   btrfs_quota_enable+0xfcc/0x21a0 fs/btrfs/qgroup.c:1226
   btrfs_ioctl_quota_ctl+0x144/0x1c0 fs/btrfs/ioctl.c:3677
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:906 [inline]
   __se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f5ad1f20289
   </TASK>
  ---[ end trace 0000000000000000 ]---

[CAUSE]
Since commit f93ee0df51 ("btrfs: convert super block writes to folio
in write_dev_supers()") and commit c94b7349b8 ("btrfs: convert super
block writes to folio in wait_dev_supers()"), the super block writeback
path is converted to use folio.

Since the original code is using page based interfaces, we have an
"ASSERT(folio_order(folio) == 0);" added to make sure everything is not
changed.

But the folio here is not from any btrfs inode, but from the block
device, and we have no control on the folio order in bdev, the device
can choose whatever folio size they want/need.

E.g. the bdev may even have a block size of multiple pages.

So the ASSERT() is triggered.

[FIX]
The super block writeback path has taken larger folios into
consideration, so there is no need for the ASSERT().

And since commit bc00965dbf ("btrfs: count super block write errors in
device instead of tracking folio error state"), the wait path no longer
checks the folio status but only wait for the folio writeback to finish,
there is nothing requiring the ASSERT() either.

So we can remove both ASSERT()s safely now.

Reported-by: syzbot+34122898a11ab689518a@syzkaller.appspotmail.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-01 01:02:42 +02:00
Mark Harmstone
9db9c7dd5b btrfs: don't clobber ret in btrfs_validate_super()
Commit 2a9bb78cfd ("btrfs: validate system chunk array at
btrfs_validate_super()") introduces a call to validate_sys_chunk_array()
in btrfs_validate_super(), which clobbers the value of ret set earlier.
This has the effect of negating the validity checks done earlier, making
it so btrfs could potentially try to mount invalid filesystems.

Fixes: 2a9bb78cfd ("btrfs: validate system chunk array at btrfs_validate_super()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:54 +01:00
Qu Wenruo
19e60b2a95 btrfs: add extra warning if delayed iput is added when it's not allowed
Since I have triggered the ASSERT() on the delayed iput too many times,
now is the time to add some extra debug warnings for delayed iput.

All delayed iputs should be queued after all ordered extents finish
their IO and all involved workqueues are flushed.

Thus after the btrfs_run_delayed_iputs() inside close_ctree(), there
should be no more delayed puts added.

So introduce a new BTRFS_FS_STATE_NO_DELAYED_IPUT, set after the above
mentioned timing.  And all btrfs_add_delayed_iput() will check that flag
and give a WARN_ON_ONCE().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:52 +01:00
Qu Wenruo
df94a342ef btrfs: run btrfs_error_commit_super() early
[BUG]
Even after all the error fixes related the
"ASSERT(list_empty(&fs_info->delayed_iputs));" in close_ctree(), I can
still hit it reliably with my experimental 2K block size.

[CAUSE]
In my case, all the error is triggered after the fs is already in error
status.

I find the following call trace to be the cause of race:

           Main thread                       |     endio_write_workers
---------------------------------------------+---------------------------
close_ctree()                                |
|- btrfs_error_commit_super()                |
|  |- btrfs_cleanup_transaction()            |
|  |  |- btrfs_destroy_all_ordered_extents() |
|  |     |- btrfs_wait_ordered_roots()       |
|  |- btrfs_run_delayed_iputs()              |
|                                            | btrfs_finish_ordered_io()
|                                            | |- btrfs_put_ordered_extent()
|                                            |    |- btrfs_add_delayed_iput()
|- ASSERT(list_empty(delayed_iputs))         |
   !!! Triggered !!!

The root cause is that, btrfs_wait_ordered_roots() only wait for
ordered extents to finish their IOs, not to wait for them to finish and
removed.

[FIX]
Since btrfs_error_commit_super() will flush and wait for all ordered
extents, it should be executed early, before we start flushing the
workqueues.

And since btrfs_error_commit_super() now runs early, there is no need to
run btrfs_run_delayed_iputs() inside it, so just remove the
btrfs_run_delayed_iputs() call from btrfs_error_commit_super().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
cda76788f8 btrfs: fix non-empty delayed iputs list on unmount due to async workers
At close_ctree() after we have ran delayed iputs either explicitly through
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.

We have (another) race where this assertion might fail because we have
queued an async write into the fs_info->workers workqueue. Here's how it
happens:

1) We are submitting a data bio for an inode that is not the data
   relocation inode, so we call btrfs_wq_submit_bio();

2) btrfs_wq_submit_bio() submits a work for the fs_info->workers queue
   that will run run_one_async_done();

3) We enter close_ctree(), flush several work queues except
   fs_info->workers, explicitly run delayed iputs with a call to
   btrfs_run_delayed_iputs() and then again shortly after by calling
   btrfs_commit_super() or btrfs_error_commit_super(), which also run
   delayed iputs;

4) run_one_async_done() is executed in the work queue, and because there
   was an IO error (bio->bi_status is not 0) it calls btrfs_bio_end_io(),
   which drops the final reference on the associated ordered extent by
   calling btrfs_put_ordered_extent() - and that adds a delayed iput for
   the inode;

5) At close_ctree() we find that after stopping the cleaner and
   transaction kthreads the delayed iputs list is not empty, failing the
   following assertion:

      ASSERT(list_empty(&fs_info->delayed_iputs));

Fix this by flushing the fs_info->workers workqueue before running delayed
iputs at close_ctree().

David reported this when running generic/648, which exercises IO error
paths by using the DM error table.

Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
Filipe Manana
4c782247b8 btrfs: fix non-empty delayed iputs list on unmount due to compressed write workers
At close_ctree() after we have ran delayed iputs either through explicitly
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.

When we have compressed writes this assertion may fail because delayed
iputs may have been added to the list after we last ran delayed iputs.
This happens like this:

1) We have a compressed write bio executing;

2) We enter close_ctree() and flush the fs_info->endio_write_workers
   queue which is the queue used for running ordered extent completion;

3) The compressed write bio finishes and enters
   btrfs_finish_compressed_write_work(), where it calls
   btrfs_finish_ordered_extent() which in turn calls
   btrfs_queue_ordered_fn(), which queues a work item in the
   fs_info->endio_write_workers queue that we have flushed before;

4) At close_ctree() we proceed, run all existing delayed iputs and
   call btrfs_commit_super() (which also runs delayed iputs), but before
   we run the following assertion below:

      ASSERT(list_empty(&fs_info->delayed_iputs))

   A delayed iput is added by the step below...

5) The ordered extent completion job queued in step 3 runs and results in
   creating a delayed iput when dropping the last reference of the ordered
   extent (a call to btrfs_put_ordered_extent() made from
   btrfs_finish_one_ordered());

6) At this point the delayed iputs list is not empty, so the assertion at
   close_ctree() fails.

Fix this by flushing the fs_info->compressed_write_workers queue at
close_ctree() before flushing the fs_info->endio_write_workers queue,
respecting the queue dependency as the later is responsible for the
execution of ordered extent completion.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Qu Wenruo
306a75e647 btrfs: allow debug builds to accept 2K block size
Currently we only support two block sizes, 4K and PAGE_SIZE.

This means on the most common architecture x86_64, we have no way to
test subpage block size.  And that's exactly I have an aarch64 machine
dedicated for subpage tests.

But this is still a hurdle for a lot of btrfs developers, and to improve
the test coverage mostly on x86_64, here we enable debug builds to
accept 2K block size.

This involves:

- Introduce a dedicated minimal block size macro
  BTRFS_MIN_BLOCKSIZE, which depends on if CONFIG_BTRFS_DEBUG is set.
  If so it's 2K, otherwise it's 4K as usual.

- Allow 4K, PAGE_SIZE and BTRFS_MIN_BLOCKSIZE as block size

- Update subpage block size checks to be based on BTRFS_MIN_BLOCKSIZE

- Export the new supported blocksize through sysfs interfaces

As most of the subpage support is already pretty mature, there is no
extra work needed to support the extra 2K block size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
Qu Wenruo
2ef9d73f2b btrfs: remove the subpage related warning message
Since the initial enablement of block size < page size support for
btrfs in v5.15, we have hit several milestones for block size < page
size (subpage) support:

- RAID56 subpage support
  In v5.19

- Refactored scrub support to support subpage better
  In v6.4

- Block perfect (previously requires page aligned ranges) compressed write
  In v6.13

- Various error handling fixes involving subpage
  In v6.14

Finally the only missing feature is the pretty simple and harmless
inlined data extent creation, just added in previous patches.

Now btrfs has all of its features ready for both regular and subpage
cases, there is no reason to output a warning about the experimental
subpage support, and we can finally remove it now.

Acked-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba
72f2bae3c1 btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_init_root_free_objectid()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
aaa5ae8f6d btrfs: use BTRFS_PATH_AUTO_FREE in load_global_roots()
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
efac576c22 btrfs: do trivial BTRFS_PATH_AUTO_FREE conversions
The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
bd06bce1b3 btrfs: use num_extent_folios() in for loop bounds
As the helper num_extent_folios() is now __pure, we can use it in for
loop without storing its value in a variable explicitly, the compiler
will do this for us.

The effects on btrfs.ko is -200 bytes and there are stack space savings
too:

btrfs_clone_extent_buffer                               -8 (32 -> 24)
btrfs_clear_buffer_dirty                                -8 (48 -> 40)
clear_extent_buffer_uptodate                            -8 (40 -> 32)
set_extent_buffer_dirty                                 -8 (32 -> 24)
write_one_eb                                            -8 (88 -> 80)
set_extent_buffer_uptodate                              -8 (40 -> 32)
read_extent_buffer_pages_nowait                        -16 (64 -> 48)
find_extent_buffer                                      -8 (32 -> 24)

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
b7226ce6c4 btrfs: simplify parameters of metadata folio helpers
Unlike folio helpers for date the ones for metadata always take the
extent buffer start and length, so they can be simplified to take the
eb only.  The fs_info can be obtained from eb too so it can be dropped
as parameter.

Added in patch "btrfs: use metadata specific helpers to simplify extent
buffer helpers".

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:47 +01:00
David Sterba
f867ccabb8 btrfs: simplify returns and labels in btrfs_init_fs_root()
There's a label that does nothing else than return, so remove it and
also change other gotos to immediate returns as the function is short
enough for this pattern.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
Qu Wenruo
fcc384be06 btrfs: require strict data/metadata split for subpage checks
Since we have btrfs_meta_is_subpage(), we should make btrfs_is_subpage()
to be data inode specific.

This change involves:

- Simplify btrfs_is_subpage()
  Now we only need to do a very simple sectorsize check against
  PAGE_SIZE.
  And since the function is pretty simple now, just make it an inline
  function.

- Add an extra ASSERT() to make sure btrfs_is_subpage() is only called
  on data inode mapping

- Migrate btree_csum_one_bio() to use btrfs_meta_folio_*() helpers
- Migrate alloc_extent_buffer() to use btrfs_meta_folio_*() helpers
- Migrate end_bbio_meta_write() to use btrfs_meta_folio_*() helpers
  Or we will trigger the ASSERT() due to calling btrfs_folio_*() on
  metadata folios.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
Qu Wenruo
619611e87f btrfs: remove btrfs_fs_info::sectors_per_page
For the future large folio support, our filemap can have folios with
different sizes, thus we can no longer rely on a fixed blocks_per_page
value.

To prepare for that future, here we do:

- Remove btrfs_fs_info::sectors_per_page

- Introduce a helper, btrfs_blocks_per_folio()
  Which uses the folio size to calculate the number of blocks for each
  folio.

- Migrate the existing btrfs_fs_info::sectors_per_page to use that
  helper
  There are some exceptions:

  * Metadata nodesize < page size support
    In the future, even if we support large folios, we will only
    allocate a folio that matches our nodesize.
    Thus we won't have a folio covering multiple metadata unless
    nodesize < page size.

  * Existing subpage bitmap dump
    We use a single unsigned long to store the bitmap.
    That means until we change the bitmap dumping code, our upper limit
    for folio size will only be 256K (4K block size, 64 bit unsigned
    long).

  * btrfs_is_subpage() check
    This will be migrated into a future patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00
Easwar Hariharan
8ef9019ed2 btrfs: convert timeouts to secs_to_jiffies()
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies().  As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies() to avoid the
multiplication

This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:

@depends on patch@
expression E;
@@

-msecs_to_jiffies
+secs_to_jiffies
(E
- * \( 1000 \| MSEC_PER_SEC \)
)

Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-5-a43967e36c88@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Dongsheng Yang <dongsheng.yang@easystack.cn>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Frank Li <frank.li@nxp.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Brown <broonie@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 23:24:16 -07:00
David Sterba
248c4ff393 btrfs: split waiting from read_extent_buffer_pages(), drop parameter wait
There are only 2 WAIT_* values left for wait parameter, we can encode
this to the function name if the waiting functionality is split.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:23 +01:00
Anand Jain
22fb0d99c9 btrfs: add tracking of read blocks for read policy
Track number of read blocks in the whole filesystem. The counter is
initialized when devices are opened. The counter is increased at
btrfs_submit_dev_bio() if the stats tracking is enabled (depends on the
read policy).  Stats tracking is disabled by default and is enabled
through fs_devices::collect_fs_stats when required.

The code is not under the EXPERIMENTAL define, as stats can be expanded
to include write counts and other performance counters, with the user
interface independent of its internal use.

This is an in-memory-only feature, not related to the dev error stats.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:21 +01:00
Anand Jain
a5019b7070 btrfs: initialize fs_devices->fs_info earlier in btrfs_init_devices_late()
Currently, fs_devices->fs_info is initialized in btrfs_init_devices_late(),
but this occurs too late for find_live_mirror(), which is invoked by
load_super_root() much earlier than btrfs_init_devices_late().

Fix this by moving the initialization to open_ctree(), before load_super_root().

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:20 +01:00
Qu Wenruo
2a9bb78cfd btrfs: validate system chunk array at btrfs_validate_super()
Currently btrfs_validate_super() only does a very basic check on the
array chunk size (not too large than the available space, but not too
small to contain no chunk).

The more comprehensive checks (the regular chunk checks and size check
inside the system chunk array) are all done inside btrfs_read_sys_array().

It's not a big deal, but it also means we do not do any validation on
the system chunk array at super block writeback time either.

Do the following modification to centralize the system chunk array
checks into btrfs_validate_super():

- Make chunk_err() helper accept stack chunk pointer
  If @leaf parameter is NULL, then the @chunk pointer will be a pointer
  to the chunk item, other than the offset inside the leaf.

  And since @leaf can be NULL, add a new @fs_info parameter for that
  case.

- Make btrfs_check_chunk_valid() handle stack chunk pointer
  The same as chunk_err(), a new @fs_info parameter, and if @leaf is
  NULL, then @chunk will be a pointer to a stack chunk.

  If @chunk is NULL, then all needed btrfs_chunk members will be read
  using the stack helper instead of the leaf helper.
  This means we need to read out all the needed member at the beginning
  of the function.

  Furthermore, at super block read time, fs_info->sectorsize is not yet
  initialized, we need one extra @sectorsize parameter to grab the
  correct sectorsize.

- Introduce a helper validate_sys_chunk_array()
  * Validate the disk key.
  * Validate the size before we access the full chunk items.
  * Do the full chunk item validation.

- Call validate_sys_chunk_array() at btrfs_validate_super()

- Simplify the checks inside btrfs_read_sys_array()
  Now the checks will be converted to an ASSERT().

- Simplify the checks inside read_one_chunk()
  Now that all chunk items inside system chunk array and chunk tree are
  verified, there is no need to verify them again inside read_one_chunk().

This change has the following advantages:

- More comprehensive checks at write time
  And unlike the sys_chunk_array read routine, this time we do not need
  to allocate a dummy extent buffer to do the check.
  All the checks done here require no new memory allocation.

- Slightly improved readability when iterating the system chunk array

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Filipe Manana
f10bef73fb btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount
During the unmount path, at close_ctree(), we first stop the cleaner
kthread, using kthread_stop() which frees the associated task_struct, and
then stop and destroy all the work queues. However after we stopped the
cleaner we may still have a worker from the delalloc_workers queue running
inode.c:submit_compressed_extents(), which calls btrfs_add_delayed_iput(),
which in turn tries to wake up the cleaner kthread - which was already
destroyed before, resulting in a use-after-free on the task_struct.

Syzbot reported this with the following stack traces:

  BUG: KASAN: slab-use-after-free in __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089
  Read of size 8 at addr ffff8880259d2818 by task kworker/u8:3/52

  CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.13.0-rc1-syzkaller-00002-gcdd30ebb1b9f #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
  Workqueue: btrfs-delalloc btrfs_work_helper
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
   print_address_description mm/kasan/report.c:378 [inline]
   print_report+0x169/0x550 mm/kasan/report.c:489
   kasan_report+0x143/0x180 mm/kasan/report.c:602
   __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
   class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
   try_to_wake_up+0xc2/0x1470 kernel/sched/core.c:4205
   submit_compressed_extents+0xdf/0x16e0 fs/btrfs/inode.c:1615
   run_ordered_work fs/btrfs/async-thread.c:288 [inline]
   btrfs_work_helper+0x96f/0xc40 fs/btrfs/async-thread.c:324
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310
   worker_thread+0x870/0xd30 kernel/workqueue.c:3391
   kthread+0x2f0/0x390 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
   </TASK>

  Allocated by task 2:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   unpoison_slab_object mm/kasan/common.c:319 [inline]
   __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345
   kasan_slab_alloc include/linux/kasan.h:250 [inline]
   slab_post_alloc_hook mm/slub.c:4104 [inline]
   slab_alloc_node mm/slub.c:4153 [inline]
   kmem_cache_alloc_node_noprof+0x1d9/0x380 mm/slub.c:4205
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1113
   copy_process+0x5d1/0x3d50 kernel/fork.c:2225
   kernel_clone+0x223/0x870 kernel/fork.c:2807
   kernel_thread+0x1bc/0x240 kernel/fork.c:2869
   create_kthread kernel/kthread.c:412 [inline]
   kthreadd+0x60d/0x810 kernel/kthread.c:767
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Freed by task 24:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
   poison_slab_object mm/kasan/common.c:247 [inline]
   __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
   kasan_slab_free include/linux/kasan.h:233 [inline]
   slab_free_hook mm/slub.c:2338 [inline]
   slab_free mm/slub.c:4598 [inline]
   kmem_cache_free+0x195/0x410 mm/slub.c:4700
   put_task_struct include/linux/sched/task.h:144 [inline]
   delayed_put_task_struct+0x125/0x300 kernel/exit.c:227
   rcu_do_batch kernel/rcu/tree.c:2567 [inline]
   rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
   handle_softirqs+0x2d4/0x9b0 kernel/softirq.c:554
   run_ksoftirqd+0xca/0x130 kernel/softirq.c:943
   smpboot_thread_fn+0x544/0xa30 kernel/smpboot.c:164
   kthread+0x2f0/0x390 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Last potentially related work creation:
   kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
   __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:544
   __call_rcu_common kernel/rcu/tree.c:3086 [inline]
   call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190
   context_switch kernel/sched/core.c:5372 [inline]
   __schedule+0x1803/0x4be0 kernel/sched/core.c:6756
   __schedule_loop kernel/sched/core.c:6833 [inline]
   schedule+0x14b/0x320 kernel/sched/core.c:6848
   schedule_timeout+0xb0/0x290 kernel/time/sleep_timeout.c:75
   do_wait_for_common kernel/sched/completion.c:95 [inline]
   __wait_for_common kernel/sched/completion.c:116 [inline]
   wait_for_common kernel/sched/completion.c:127 [inline]
   wait_for_completion+0x355/0x620 kernel/sched/completion.c:148
   kthread_stop+0x19e/0x640 kernel/kthread.c:712
   close_ctree+0x524/0xd60 fs/btrfs/disk-io.c:4328
   generic_shutdown_super+0x139/0x2d0 fs/super.c:642
   kill_anon_super+0x3b/0x70 fs/super.c:1237
   btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2112
   deactivate_locked_super+0xc4/0x130 fs/super.c:473
   cleanup_mnt+0x41f/0x4b0 fs/namespace.c:1373
   task_work_run+0x24f/0x310 kernel/task_work.c:239
   ptrace_notify+0x2d2/0x380 kernel/signal.c:2503
   ptrace_report_syscall include/linux/ptrace.h:415 [inline]
   ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline]
   syscall_exit_work+0xc7/0x1d0 kernel/entry/common.c:173
   syscall_exit_to_user_mode_prepare kernel/entry/common.c:200 [inline]
   __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
   syscall_exit_to_user_mode+0x24a/0x340 kernel/entry/common.c:218
   do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  The buggy address belongs to the object at ffff8880259d1e00
   which belongs to the cache task_struct of size 7424
  The buggy address is located 2584 bytes inside of
   freed 7424-byte region [ffff8880259d1e00, ffff8880259d3b00)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x259d0
  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
  memcg:ffff88802f4b56c1
  flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
  page_type: f5(slab)
  raw: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122
  raw: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1
  head: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122
  head: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1
  head: 00fff00000000003 ffffea0000967401 ffffffffffffffff 0000000000000000
  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: kasan: bad access detected
  page_owner tracks the page as allocated
  page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 12, tgid 12 (kworker/u8:1), ts 7328037942, free_ts 0
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1556
   prep_new_page mm/page_alloc.c:1564 [inline]
   get_page_from_freelist+0x3651/0x37a0 mm/page_alloc.c:3474
   __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4751
   alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
   alloc_slab_page+0x6a/0x140 mm/slub.c:2408
   allocate_slab+0x5a/0x2f0 mm/slub.c:2574
   new_slab mm/slub.c:2627 [inline]
   ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3815
   __slab_alloc+0x58/0xa0 mm/slub.c:3905
   __slab_alloc_node mm/slub.c:3980 [inline]
   slab_alloc_node mm/slub.c:4141 [inline]
   kmem_cache_alloc_node_noprof+0x269/0x380 mm/slub.c:4205
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1113
   copy_process+0x5d1/0x3d50 kernel/fork.c:2225
   kernel_clone+0x223/0x870 kernel/fork.c:2807
   user_mode_thread+0x132/0x1a0 kernel/fork.c:2885
   call_usermodehelper_exec_work+0x5c/0x230 kernel/umh.c:171
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310
   worker_thread+0x870/0xd30 kernel/workqueue.c:3391
  page_owner free stack trace missing

  Memory state around the buggy address:
   ffff8880259d2700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880259d2780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff8880259d2800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
   ffff8880259d2880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880259d2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

Fix this by flushing the delalloc workers queue before stopping the
cleaner kthread.

Reported-by: syzbot+b7cf50a0c173770dcb14@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/674ed7e8.050a0220.48a03.0031.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-06 15:04:18 +01:00
Filipe Manana
c3a5888e0f btrfs: remove fs_info parameter from btrfs_cleanup_one_transaction()
The fs_info parameter is redundant because it can be extracted from the
transaction given as another parameter. So remove it and use the fs_info
accessible from the transaction.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:19 +01:00
Filipe Manana
2f6e05a5cc btrfs: remove fs_info parameter from btrfs_destroy_delayed_refs()
The fs_info parameter is redundant because it can be extracted from the
transaction given as another parameter. So remove it and use the fs_info
accessible from the transaction.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:19 +01:00
Filipe Manana
22a0ae1889 btrfs: move btrfs_destroy_delayed_refs() to delayed-ref.c
It's better suited at delayed-ref.c since it's about delayed refs and
contains logic to iterate over them (using the red black tree, doing all
the locking, freeing, etc), so move it from disk-io.c, which is pretty
big, into delayed-ref.c, hiding implementation details of how delayed
refs are tracked and managed. This also facilitates the next patches in
the series.

This change moves the code between files but also does the following
simple cleanups:

1) Rename the 'cache' variable to 'bg', since it's a block group
   (the 'cache' logic comes from old days where the block group
   structure was named 'btrfs_block_group_cache');

2) Move the 'ref' variable declaration to the scope of the inner
   while loop, since it's not used outside that loop.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:19 +01:00
Filipe Manana
00f529661b btrfs: remove BUG_ON() at btrfs_destroy_delayed_refs()
At btrfs_destroy_delayed_refs() it's unexpected to not find the block
group to which a delayed reference's extent belongs to, so we have this
BUG_ON(), not just because it's highly unexpected but also because we
don't know what to do there.

Since we are in the transaction abort path, there's nothing we can do
other than proceed and cleanup all used resources we can. So remove
the BUG_ON() and deal with a missing block group by logging an error
message and continuing to cleanup all we can related to the current
delayed ref head and moving to other delayed refs.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:19 +01:00
Filipe Manana
e7fa845010 btrfs: rename extent map shrinker members from struct btrfs_fs_info
The names for the members of struct btrfs_fs_info related to the extent
map shrinker are a bit too long, so rename them to be shorter by replacing
the "extent_map_" prefix with the "em_" prefix.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:17 +01:00
Filipe Manana
70a5f9e266 btrfs: simplify tracking progress for the extent map shrinker
Now that the extent map shrinker can only be run by a single task (as a
work queue item) there is no need to keep the progress of the shrinker
protected by a spinlock and passing the progress to trace events as
parameters. So remove the lock and simplify the arguments for the trace
events.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:17 +01:00
Filipe Manana
1020443840 btrfs: make the extent map shrinker run asynchronously as a work queue job
Currently the extent map shrinker is run synchronously for kswapd tasks
that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
This has some disadvantages and for some heavy workloads with memory
pressure it can cause some delays and stalls that make a machine
unresponsive for some periods. This happens because:

1) We can have several kswapd tasks on machines with multiple NUMA zones,
   and running the extent map shrinker concurrently can cause high
   contention on some spin locks, namely the spin locks that protect
   the radix tree that tracks roots, the per root xarray that tracks
   open inodes and the list of delayed iputs. This not only delays the
   shrinker but also causes high CPU consumption and makes the task
   running the shrinker monopolize a core, resulting in the symptoms
   of an unresponsive system. This was noted in previous commits such as
   commit ae1e766f62 ("btrfs: only run the extent map shrinker from
   kswapd tasks");

2) The extent map shrinker's iteration over inodes can often be slow, even
   after changing the data structure that tracks open inodes for a root
   from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
   The transition to the xarray while it made things a bit faster, it's
   still somewhat slow - for example in a test scenario with 10000 inodes
   that have no extent maps loaded, the extent map shrinker took between
   5ms to 8ms, using a release, non-debug kernel. Iterating over the
   extent maps of an inode can also be slow if have an inode with many
   thousands of extent maps, since we use a red black tree to track and
   search extent maps. So having the extent map shrinker run synchronously
   adds extra delay for other things a kswapd task does.

So make the extent map shrinker run asynchronously as a job for the
system unbounded workqueue, just like what we do for data and metadata
space reclaim jobs.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:17 +01:00
David Sterba
d12a1a2a30 btrfs: drop unused parameter transaction from alloc_log_tree()
The function got split in commit 6ab6ebb760 ("btrfs: split
alloc_log_tree()") and since then transaction parameter has been unused.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:17 +01:00
David Sterba
87cbab8636 btrfs: drop unused parameter options from open_ctree()
Since the new mount option parser in commit ad21f15b0f ("btrfs:
switch to the new mount API") we don't pass the options like that
anymore.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:17 +01:00