The write-retry algorithm will insert extra subrequests into the list if it
can't get sufficient capacity to split the range that needs to be retried
into the sequence of subrequests it currently has (for instance, if the
cifs credit pool has fewer credits available than it did when the range was
originally divided).
However, the allocator furnishes each new subreq with 2 refs and then
another is added for resubmission, causing one to be leaked.
Fix this by replacing the ref-getting line with a neutral trace line.
Fixes: 288ace2f57 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-6-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
netfs_wait_for_request() and netfs_wait_for_pause() can loop forever if
netfs_collect_in_app() returns 2, indicating that it wants to repeat
because the ALL_QUEUED flag isn't yet set and there are no subreqs left
that haven't been collected.
The problem is that, unless collection is offloaded (OFFLOAD_COLLECTION),
we have to return to the application thread to continue and eventually set
ALL_QUEUED after pausing to deal with a retry - but we never get there.
Fix this by inserting checks for the IN_PROGRESS and PAUSE flags as
appropriate before cycling round - and add cond_resched() for good measure.
Fixes: 2b1424cd13 ("netfs: Fix wait/wake to be consistent about the waitqueue used")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-5-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
If a netfs request finishes during the pause loop, it will have the ref
that belongs to the IN_PROGRESS flag removed at that point - however, if it
then goes to the final wait loop, that will *also* put the ref because it
sees that the IN_PROGRESS flag is clear and incorrectly assumes that this
happened when it called the collector.
In fact, since IN_PROGRESS is clear, we shouldn't call the collector again
since it's done all the cleanup, such as calling ->ki_complete().
Fix this by making netfs_collect_in_app() just return, indicating that
we're done if IN_PROGRESS is removed.
Fixes: 2b1424cd13 ("netfs: Fix wait/wake to be consistent about the waitqueue used")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-3-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: Steve French <sfrench@samba.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
When doing a DIO read, if the subrequests we issue fail and cause the
request PAUSE flag to be set to put a pause on subrequest generation, we
may complete collection of the subrequests (possibly discarding them) prior
to the ALL_QUEUED flags being set.
In such a case, netfs_read_collection() doesn't see ALL_QUEUED being set
after netfs_collect_read_results() returns and will just return to the app
(the collector can be seen unpausing the generator in the trace log).
The subrequest generator can then set ALL_QUEUED and the app thread reaches
netfs_wait_for_request(). This causes netfs_collect_in_app() to be called
to see if we're done yet, but there's missing case here.
netfs_collect_in_app() will see that a thread is active and set inactive to
false, but won't see any subrequests in the read stream, and so won't set
need_collect to true. The function will then just return 0, indicating
that the caller should just sleep until further activity (which won't be
forthcoming) occurs.
Fix this by making netfs_collect_in_app() check to see if an active thread
is complete - i.e. that ALL_QUEUED is set and the subrequests list is empty
- and to skip the sleep return path. The collector will then be called
which will clear the request IN_PROGRESS flag, allowing the app to
progress.
Fixes: 2b1424cd13 ("netfs: Fix wait/wake to be consistent about the waitqueue used")
Reported-by: Steve French <sfrench@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-2-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The ready event list of an epoll object is protected by read-write
semaphore:
- The consumer (waiter) acquires the write lock and takes items.
- the producer (waker) takes the read lock and adds items.
The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.
Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].
Furthermore, this could also cause stall problem, as described in [2].
To fix this problem, make the event list half-lockless:
- The consumer acquires a mutex (ep->mtx) and takes items.
- The producer locklessly adds items to the list.
Performance is not the main goal of this patch, but as the producer now can
add items without waiting for consumer to release the lock, performance
improvement is observed using the stress test from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c. This is
the same test that justified using read-write semaphore in the past.
Testing using 12 x86_64 CPUs:
Before After Diff
threads events/ms events/ms
8 6932 19753 +185%
16 7820 27923 +257%
32 7648 35164 +360%
64 9677 37780 +290%
128 11166 38174 +242%
Testing using 1 riscv64 CPU (averaged over 10 runs, as the numbers are
noisy):
Before After Diff
threads events/ms events/ms
1 73 129 +77%
2 151 216 +43%
4 216 364 +69%
8 234 382 +63%
16 251 392 +56%
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1]
Reported-by: Valentin Schneider <vschneid@redhat.com>
Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2]
Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/20250527090836.1290532-1-namcao@linutronix.de
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Today, a few work structs inside tcon are initialized inside
cifs_get_tcon and not in tcon_info_alloc. As a result, if a tcon
is obtained from tcon_info_alloc, but not called as a part of
cifs_get_tcon, we may trip over.
Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add FALLOC_FL_ALLOCATE_RANGE to the set of supported fallocate flags in
XFS_FALLOC_FL_SUPPORTED. This change improves code clarity and maintains
by explicitly showing this flag in the supported flags mask.
Note that since FALLOC_FL_ALLOCATE_RANGE is defined as 0x00, this addition
has no functional modifications.
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When SMB 3.1.1 POSIX Extensions are negotiated, userspace applications
using readdir() or getdents() calls without stat() on each individual file
(such as a simple "ls" or "find") would misidentify file types and exhibit
strange behavior such as not descending into directories. The reason for
this behavior is an oversight in the cifs_posix_to_fattr conversion
function. Instead of extracting the entry type for cf_dtype from the
properly converted cf_mode field, it tries to extract the type from the
PDU. While the wire representation of the entry mode is similar in
structure to POSIX stat(), the assignments of the entry types are
different. Applying the S_DT macro to cf_mode instead yields the correct
result. This is also what the equivalent function
smb311_posix_info_to_fattr in inode.c already does for stat() etc.; which
is why "ls -l" would give the correct file type but "ls" would not (as
identified by the colors).
Cc: stable@vger.kernel.org
Signed-off-by: Philipp Kerling <pkerling@casix.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmhfImcACgkQiiy9cAdy
T1EMWwv+LSveqDxylD28bL12GuCudX2b894QzUhckC9DHerRDarLcWSEvRo2/dT1
y5s//OL2NmvbfxfTnbQLSY7o+r3NOvF6XRAd8ooOSolyEQKWeRO5byBFFszb/PNC
ajlXX+xgA4VxzZWWOXOFYAz5EIPF4avYxHmvF9BD/RhS6u2k9ZaDCvtOqzreOlfa
t1V/eqcZwvlde7dObTEgJKgHD5Mej/+vlDRu+vQnrqOYnrLbZgWImOm05lJyKpat
qibPrnOtejzcWChpjvv2OqF4Jsige/Bzko3W/v60ZOsFfc5eegxAHpGrdbFuI4fN
E9vP/PA8n9gS8LZZkHTuHv7b78XOx06qhtiNcPNIpmJmlg1I2XlAwa1ifDd5F5Im
C4sstB2ujJ28k5ReqkYnd5Q8SvVmoCcP+31bP6ibUrk+onYrXPHRJp+CvUrniYIp
WNHkPwL2khrsRjHb2GnKwQQvdvyZa0/Vams2WtTHsK6CMOQMm14MHcJq1KNsd4oz
UXNq2D5j
=vWu+
-----END PGP SIGNATURE-----
Merge tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Multichannel reconnect lock ordering deadlock fix
- Fix for regression in handling native Windows symlinks
- Three smbdirect fixes:
- oops in RDMA response processing
- smbdirect memcpy issue
- fix smbdirect regression with large writes (smbdirect test cases
now all passing)
- Fix for "FAILED_TO_PARSE" warning in trace-cmd report output
* tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code
cifs: Fix the smbd_response slab to allow usercopy
smb: client: fix potential deadlock when reconnecting channels
smb: client: remove \t from TP_printk statements
smb: client: let smbd_post_send_iter() respect the peers max_send_size and transmit all data
smb: client: fix regression with native SMB symlinks
or aren't considered necessary for -stable kernels. 5 are for MM.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaF8vtQAKCRDdBJ7gKXxA
jlK9AP9Syx5isoE7MAMKjr9iI/2z+NRaCCro/VM4oQk8m2cNFgD/ZsL9YMhjZlcL
bMIVUZ9E+yf1w9dLeHLoDba+pnF7Wwc=
=vdkO
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"16 hotfixes.
6 are cc:stable and the remainder address post-6.15 issues or aren't
considered necessary for -stable kernels. 5 are for MM"
* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add Lorenzo as THP co-maintainer
mailmap: update Duje Mihanović's email address
selftests/mm: fix validate_addr() helper
crashdump: add CONFIG_KEYS dependency
mailmap: correct name for a historical account of Zijun Hu
mailmap: add entries for Zijun Hu
fuse: fix runtime warning on truncate_folio_batch_exceptionals()
scripts/gdb: fix dentry_name() lookup
mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
mm/hugetlb: remove unnecessary holding of hugetlb_lock
MAINTAINERS: add missing files to mm page alloc section
MAINTAINERS: add tree entry to mm init block
mm: add OOM killer maintainer structure
fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
We are setting the parent directory's last_unlink_trans directly which
may result in a concurrent task starting to log the directory not see the
update and therefore can log the directory after we removed a child
directory which had a snapshot within instead of falling back to a
transaction commit. Replaying such a log tree would result in a mount
failure since we can't currently delete snapshots (and subvolumes) during
log replay. This is the type of failure described in commit 1ec9a1ae1e
("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync").
Fix this by using btrfs_record_snapshot_destroy() which updates the
last_unlink_trans field while holding the inode's log_mutex lock.
Fixes: 44f714dae5 ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case the removed directory had a snapshot that was deleted, we are
propagating its inode's last_unlink_trans to the parent directory after
we removed the entry from the parent directory. This leaves a small race
window where someone can log the parent directory after we removed the
entry and before we updated last_unlink_trans, and as a result if we ever
try to replay such a log tree, we will fail since we will attempt to
remove a snapshot during log replay, which is currently not possible and
results in the log replay (and mount) to fail. This is the type of failure
described in commit 1ec9a1ae1e ("Btrfs: fix unreplayable log after
snapshot delete + parent dir fsync").
So fix this by propagating the last_unlink_trans to the parent directory
before we remove the entry from it.
Fixes: 44f714dae5 ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of recording that a new subvolume was created in a directory after
we add the entry do the directory, record it before adding the entry. This
is to avoid races where after creating the entry and before recording the
new subvolume in the directory (the call to btrfs_record_new_subvolume()),
another task logs the directory, so we end up with a log tree where we
logged a directory that has an entry pointing to a root that was not yet
committed, resulting in an invalid entry if the log is persisted and
replayed later due to a power failure or crash.
Also state this requirement in the function comment for
btrfs_record_new_subvolume(), similar to what we do for the
btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy().
Fixes: 45c4102f0d ("btrfs: avoid transaction commit on any fsync after subvolume creation")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When replaying log trees we use read_one_inode() to get an inode, which is
just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for
btrfs_iget(). But read_one_inode() always returns NULL for any error
that btrfs_iget_logging() / btrfs_iget() may return and this is a problem
because:
1) In many callers of read_one_inode() we convert the NULL into -EIO,
which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT
for example, besides -EIO and other errors. So during log replay we
may end up reporting a false -EIO, which is confusing since we may
not have had any IO error at all;
2) When replaying directory deletes, at replay_dir_deletes(), we assume
the NULL returned from read_one_inode() means that the inode doesn't
exist and then proceed as if no error had happened. This is wrong
because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an
actual error and the target inode may exist in the target subvolume
root - this may later result in the log replay code failing at a
later stage (if we are "lucky") or succeed but leaving some
inconsistency in the filesystem.
So fix this by not ignoring errors from btrfs_iget_logging() and as
a consequence remove the read_one_inode() wrapper and just use
btrfs_iget_logging() directly. Also since btrfs_iget_logging() is
supposed to be called only against subvolume roots, just like
read_one_inode() which had a comment about it, add an assertion to
btrfs_iget_logging() to check that the target root corresponds to a
subvolume root.
Fixes: 5d4f98a28c ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At __inode_add_ref() when processing extrefs, if we jump into the next
label we have an undefined value of victim_name.len, since we haven't
initialized it before we did the goto. This results in an invalid memory
access in the next iteration of the loop since victim_name.len was not
initialized to the length of the name of the current extref.
Fix this by initializing victim_name.len with the current extref's name
length.
Fixes: e43eec81c5 ("btrfs: use struct qstr instead of name and namelen pairs")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During log replay, at __add_inode_ref(), when we are searching for inode
ref keys we totally ignore if btrfs_search_slot() returns an error. This
may make a log replay succeed when there was an actual error and leave
some metadata inconsistency in a subvolume tree. Fix this by checking if
an error was returned from btrfs_search_slot() and if so, return it to
the caller.
Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are rebuilding a free space tree, while modifying the free space
tree we may need to allocate a new metadata block group.
If we end up using multiple transactions for the rebuild, when we call
btrfs_end_transaction() we enter btrfs_create_pending_block_groups()
which calls add_block_group_free_space() to add items to the free space
tree for the block group.
Then later during the free space tree rebuild, at
btrfs_rebuild_free_space_tree(), we may find such new block groups
and call populate_free_space_tree() for them, which fails with -EEXIST
because there are already items in the free space tree. Then we abort the
transaction with -EEXIST at btrfs_rebuild_free_space_tree().
Notice that we say "may find" the new block groups because a new block
group may be inserted in the block groups rbtree, which is being iterated
by the rebuild process, before or after the current node where the rebuild
process is currently at.
Syzbot recently reported such case which produces a trace like the
following:
------------[ cut here ]------------
BTRFS: Transaction aborted (error -17)
WARNING: CPU: 1 PID: 7626 at fs/btrfs/free-space-tree.c:1341 btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
Modules linked in:
CPU: 1 UID: 0 PID: 7626 Comm: syz.2.25 Not tainted 6.15.0-rc7-syzkaller-00085-gd7fa1af5b33e-dirty #0 PREEMPT
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
lr : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
sp : ffff80009c4f7740
x29: ffff80009c4f77b0 x28: ffff0000d4c3f400 x27: 0000000000000000
x26: dfff800000000000 x25: ffff70001389eee8 x24: 0000000000000003
x23: 1fffe000182b6e7b x22: 0000000000000000 x21: ffff0000c15b73d8
x20: 00000000ffffffef x19: ffff0000c15b7378 x18: 1fffe0003386f276
x17: ffff80008f31e000 x16: ffff80008adbe98c x15: 0000000000000001
x14: 1fffe0001b281550 x13: 0000000000000000 x12: 0000000000000000
x11: ffff60001b281551 x10: 0000000000000003 x9 : 1c8922000a902c00
x8 : 1c8922000a902c00 x7 : ffff800080485878 x6 : 0000000000000000
x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff80008047843c
x2 : 0000000000000001 x1 : ffff80008b3ebc40 x0 : 0000000000000001
Call trace:
btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 (P)
btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074
btrfs_remount_rw fs/btrfs/super.c:1319 [inline]
btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543
reconfigure_super+0x1d4/0x6f0 fs/super.c:1083
do_remount fs/namespace.c:3365 [inline]
path_mount+0xb34/0xde0 fs/namespace.c:4200
do_mount fs/namespace.c:4221 [inline]
__do_sys_mount fs/namespace.c:4432 [inline]
__se_sys_mount fs/namespace.c:4409 [inline]
__arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600
irq event stamp: 330
hardirqs last enabled at (329): [<ffff80008048590c>] raw_spin_rq_unlock_irq kernel/sched/sched.h:1525 [inline]
hardirqs last enabled at (329): [<ffff80008048590c>] finish_lock_switch+0xb0/0x1c0 kernel/sched/core.c:5130
hardirqs last disabled at (330): [<ffff80008adb9e60>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:511
softirqs last enabled at (10): [<ffff8000801fbf10>] local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32
softirqs last disabled at (8): [<ffff8000801fbedc>] local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19
---[ end trace 0000000000000000 ]---
Fix this by flagging new block groups which had their free space tree
entries already added and then skip them in the rebuild process. Also,
since the rebuild may be triggered when doing a remount, make sure that
when we clear an existing free space tree that we clear such flag from
every existing block group, otherwise we would skip those block groups
during the rebuild.
Reported-by: syzbot+d0014fb0fc39c5487ae5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/68460a54.050a0220.daf97.0af5.GAE@google.com/
Fixes: 882af9f13e ("btrfs: handle free space tree rebuild in multiple transactions")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Checking for invalid IDs was introduced in 9e7cfb35e2 ("bcachefs: Check for invalid btree IDs")
to prevent an invalid shift later, but since 1415265480 ("bcachefs: Bad btree roots are now autofix")
which made btree_root_bkey_invalid autofix, the fsck_err_on call didn't
do anything.
We can mark this err type (invalid_btree_id) autofix as well, so it gets
handled.
Reported-by: syzbot+029d1989099aa5ae3e89@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=029d1989099aa5ae3e89
Fixes: 1415265480 ("bcachefs: Bad btree roots are now autofix")
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Unmount of a shutdown filesystem can hang with stale inode cluster
buffers in the AIL like so:
[95964.140623] Call Trace:
[95964.144641] __schedule+0x699/0xb70
[95964.154003] schedule+0x64/0xd0
[95964.156851] xfs_ail_push_all_sync+0x9b/0xf0
[95964.164816] xfs_unmount_flush_inodes+0x41/0x70
[95964.168698] xfs_unmountfs+0x7f/0x170
[95964.171846] xfs_fs_put_super+0x3b/0x90
[95964.175216] generic_shutdown_super+0x77/0x160
[95964.178060] kill_block_super+0x1b/0x40
[95964.180553] xfs_kill_sb+0x12/0x30
[95964.182796] deactivate_locked_super+0x38/0x100
[95964.185735] deactivate_super+0x41/0x50
[95964.188245] cleanup_mnt+0x9f/0x160
[95964.190519] __cleanup_mnt+0x12/0x20
[95964.192899] task_work_run+0x89/0xb0
[95964.195221] resume_user_mode_work+0x4f/0x60
[95964.197931] syscall_exit_to_user_mode+0x76/0xb0
[95964.201003] do_syscall_64+0x74/0x130
$ pstree -N mnt |grep umount
|-check-parallel---nsexec---run_test.sh---753---umount
It always seems to be generic/753 that triggers this, and repeating
a quick group test run triggers it every 10-15 iterations. Hence it
generally triggers once up every 30-40 minutes of test time. just
running generic/753 by itself or concurrently with a limited group
of tests doesn't reproduce this issue at all.
Tracing on a hung system shows the AIL repeating every 50ms a log
force followed by an attempt to push pinned, aborted inodes from the
AIL (trimmed for brevity):
xfs_log_force: lsn 0x1c caller xfsaild+0x18e
xfs_log_force: lsn 0x0 caller xlog_cil_flush+0xbd
xfs_log_force: lsn 0x1c caller xfs_log_force+0x77
xfs_ail_pinned: lip 0xffff88826014afa0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff88814000a708 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff88810b850c80 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff88810b850af0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff888165cf0a28 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff88810b850bb8 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
....
The inode log items are marked as aborted, which means that either:
a) a transaction commit has occurred, seen an error or shutdown, and
called xfs_trans_free_items() to abort the items. This should happen
before any pinning of log items occurs.
or
b) a dirty transaction has been cancelled. This should also happen
before any pinning of log items occurs.
or
c) AIL insertion at journal IO completion is marked as aborted. In
this case, the log item is pinned by the CIL until journal IO
completes and hence needs to be unpinned. This is then done after
the ->iop_committed() callback is run, so the pin count should be
balanced correctly.
Yet none of these seemed to be occurring. Further tracing indicated
this:
d) Shutdown during CIL pushing resulting in log item completion
being called from checkpoint abort processing. Items are unpinned
and released without serialisation against each other, journal IO
completion or transaction commit completion.
In this case, we may still have a transaction commit in flight that
holds a reference to a xfs_buf_log_item (BLI) after CIL insertion.
e.g. a synchronous transaction will flush the CIL before the
transaction is torn down. The concurrent CIL push then aborts
insertion it and drops the commit/AIL reference to the BLI. This can
leave the transaction commit context with the last reference to the
BLI which is dropped here:
xfs_trans_free_items()
->iop_release
xfs_buf_item_release
xfs_buf_item_put
if (XFS_LI_ABORTED)
xfs_trans_ail_delete
xfs_buf_item_relse()
Unlike the journal completion ->iop_unpin path, this path does not
run stale buffer completion process when it drops the last
reference, hence leaving the stale inodes attached to the buffer
sitting the AIL. There are no other references to those inodes, so
there is no other mechanism to remove them from the AIL. Hence
unmount hangs.
The buffer lock context for stale buffers is passed to the last BLI
reference. This is normally the last BLI unpin on journal IO
completion. The unpin then processes the stale buffer completion and
releases the buffer lock. However, if the final unpin from journal
IO completion (or CIL push abort) does not hold the last reference
to the BLI, there -must- still be a transaction context that
references the BLI, and so that context must perform the stale
buffer completion processing before the buffer is unlocked and the
BLI torn down.
The fix for this is to rework the xfs_buf_item_relse() path to run
stale buffer completion processing if it drops the last reference to
the BLI. We still hold the buffer locked, so the buffer owner and
lock context is the same as if we passed the BLI and buffer to the
->iop_unpin() context to finish stale process on journal commit.
However, we have to be careful here. In a shutdown state, we can be
freeing dirty BLIs from xfs_buf_item_put() via xfs_trans_brelse()
and xfs_trans_bdetach(). The existing code handles this case by
considering shutdown state as "aborted", but in doing so
largely masks the failure to clean up stale BLI state from the
xfs_buf_item_relse() path. i.e regardless of the shutdown state and
whether the item is in the AIL, we must finish the stale buffer
cleanup if we are are dropping the last BLI reference from the
->iop_relse path in transaction commit context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The stale buffer item completion handling is currently only done
from BLI unpinning. We need to perform this function from where-ever
the last reference to the BLI is dropped, so first we need to
factor this code out into a helper.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The code to initialise, release and free items is all the way down
the bottom of the file. Upcoming fixes need to these functions
earlier in the file, so move them to the top.
There is one code change in this move - the parameter to
xfs_buf_item_relse() is changed from the xfs_buf to the
xfs_buf_log_item - the thing that the function is releasing.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
I needed more insight into how stale inodes were getting stuck on
the AIL after a forced shutdown when running fsstress. These are the
tracepoints I added for that purpose.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
On shutdown when quotas are enabled, the shutdown can deadlock
trying to unpin the dquot buffer buf_log_item like so:
[ 3319.483590] task:kworker/20:0H state:D stack:14360 pid:1962230 tgid:1962230 ppid:2 task_flags:0x4208060 flags:0x00004000
[ 3319.493966] Workqueue: xfs-log/dm-6 xlog_ioend_work
[ 3319.498458] Call Trace:
[ 3319.500800] <TASK>
[ 3319.502809] __schedule+0x699/0xb70
[ 3319.512672] schedule+0x64/0xd0
[ 3319.515573] schedule_timeout+0x30/0xf0
[ 3319.528125] __down_common+0xc3/0x200
[ 3319.531488] __down+0x1d/0x30
[ 3319.534186] down+0x48/0x50
[ 3319.540501] xfs_buf_lock+0x3d/0xe0
[ 3319.543609] xfs_buf_item_unpin+0x85/0x1b0
[ 3319.547248] xlog_cil_committed+0x289/0x570
[ 3319.571411] xlog_cil_process_committed+0x6d/0x90
[ 3319.575590] xlog_state_shutdown_callbacks+0x52/0x110
[ 3319.580017] xlog_force_shutdown+0x169/0x1a0
[ 3319.583780] xlog_ioend_work+0x7c/0xb0
[ 3319.587049] process_scheduled_works+0x1d6/0x400
[ 3319.591127] worker_thread+0x202/0x2e0
[ 3319.594452] kthread+0x20c/0x240
The CIL push has seen the deadlock, so it has aborted the push and
is running CIL checkpoint completion to abort all the items in the
checkpoint. This calls ->iop_unpin(remove = true) to clean up the
log items in the checkpoint.
When a buffer log item is unpined like this, it needs to lock the
buffer to run io completion to correctly fail the buffer and run all
the required completions to fail attached log items as well. In this
case, the attempt to lock the buffer on unpin is hanging because the
buffer is already locked.
I suspected a leaked XFS_BLI_HOLD state because of XFS_BLI_STALE
handling changes I was testing, so I went looking for
pin events on HOLD buffers and unpin events on locked buffer. That
isolated this one buffer with these two events:
xfs_buf_item_pin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 2 pincount 0 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags HOLD|DIRTY|LOGGED liflags DIRTY
....
xfs_buf_item_unpin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 4 pincount 1 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags DIRTY liflags ABORTED
Firstly, bbcount = 0x2, which means it is not a single sector
structure. That rules out every xfs_trans_bhold() case except one:
dquot buffers.
Then hung task dumping gave this trace:
[ 3197.312078] task:fsync-tester state:D stack:12080 pid:2051125 tgid:2051125 ppid:1643233 task_flags:0x400000 flags:0x00004002
[ 3197.323007] Call Trace:
[ 3197.325581] <TASK>
[ 3197.327727] __schedule+0x699/0xb70
[ 3197.334582] schedule+0x64/0xd0
[ 3197.337672] schedule_timeout+0x30/0xf0
[ 3197.350139] wait_for_completion+0xbd/0x180
[ 3197.354235] __flush_workqueue+0xef/0x4e0
[ 3197.362229] xlog_cil_force_seq+0xa0/0x300
[ 3197.374447] xfs_log_force+0x77/0x230
[ 3197.378015] xfs_qm_dqunpin_wait+0x49/0xf0
[ 3197.382010] xfs_qm_dqflush+0x55/0x460
[ 3197.385663] xfs_qm_dquot_isolate+0x29e/0x4d0
[ 3197.389977] __list_lru_walk_one+0x141/0x220
[ 3197.398867] list_lru_walk_one+0x10/0x20
[ 3197.402713] xfs_qm_shrink_scan+0x6a/0x100
[ 3197.406699] do_shrink_slab+0x18a/0x350
[ 3197.410512] shrink_slab+0xf7/0x430
[ 3197.413967] drop_slab+0x97/0xf0
[ 3197.417121] drop_caches_sysctl_handler+0x59/0xc0
[ 3197.421654] proc_sys_call_handler+0x18b/0x280
[ 3197.426050] proc_sys_write+0x13/0x20
[ 3197.429750] vfs_write+0x2b8/0x3e0
[ 3197.438532] ksys_write+0x7e/0xf0
[ 3197.441742] __x64_sys_write+0x1b/0x30
[ 3197.445363] x64_sys_call+0x2c72/0x2f60
[ 3197.449044] do_syscall_64+0x6c/0x140
[ 3197.456341] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Yup, another test run by check-parallel is running drop_caches
concurrently and the dquot shrinker for the hung filesystem is
running. That's trying to flush a dirty dquot from reclaim context,
and it waiting on a log force to complete. xfs_qm_dqflush is called
with the dquot buffer held locked, and so we've called
xfs_log_force() with that buffer locked.
Now the log force is waiting for a workqueue flush to complete, and
that workqueue flush is waiting of CIL checkpoint processing to
finish.
The CIL checkpoint processing is aborting all the log items it has,
and that requires locking aborted buffers to cancel them.
Now, normally this isn't a problem if we are issuing a log force
to unpin an object, because the ->iop_unpin() method wakes pin
waiters first. That results in the pin waiter finishing off whatever
it was doing, dropping the lock and then xfs_buf_item_unpin() can
lock the buffer and fail it.
However, xfs_qm_dqflush() is waiting on the -dquot- unpin event, not
the dquot buffer unpin event, and so it never gets woken and so does
not drop the buffer lock.
Inodes do not have this problem, as they can only be written from
one spot (->iop_push) whilst dquots can be written from multiple
places (memory reclaim, ->iop_push, xfs_dq_dqpurge, and quotacheck).
The reason that the dquot buffer has an attached buffer log item is
that it has been recently allocated. Initialisation of the dquot
buffer logs the buffer directly, thereby pinning it in memory. We
then modify the dquot in a separate operation, and have memory
reclaim racing with a shutdown and we trigger this deadlock.
check-parallel reproduces this reliably on 1kB FSB filesystems with
quota enabled because it does all of these things concurrently
without having to explicitly write tests to exercise these corner
case conditions.
xfs_qm_dquot_logitem_push() doesn't have this deadlock because it
checks if the dquot is pinned before locking the dquot buffer and
skipping it if it is pinned. This means the xfs_qm_dqunpin_wait()
log force in xfs_qm_dqflush() never triggers and we unlock the
buffer safely allowing a concurrent shutdown to fail the buffer
appropriately.
xfs_qm_dqpurge() could have this problem as it is called from
quotacheck and we might have allocated dquot buffers when recording
the quota updates. This can be fixed by calling
xfs_qm_dqunpin_wait() before we lock the dquot buffer. Because we
hold the dquot locked, nothing will be able to add to the pin count
between the unpin_wait and the dqflush callout, so this now makes
xfs_qm_dqpurge() safe against this race.
xfs_qm_dquot_isolate() can also be fixed this same way but, quite
frankly, we shouldn't be doing IO in memory reclaim context. If the
dquot is pinned or dirty, simply rotate it and let memory reclaim
come back to it later, same as we do for inodes.
This then gets rid of the nasty issue in xfs_qm_flush_one() where
quotacheck writeback races with memory reclaim flushing the dquots.
We can lift xfs_qm_dqunpin_wait() up into this code, then get rid of
the "can't get the dqflush lock" buffer write to cycle the dqlfush
lock and enable it to be flushed again. checking if the dquot is
pinned and returning -EAGAIN so that the dquot walk will revisit the
dquot again later.
Finally, with xfs_qm_dqunpin_wait() lifted into all the callers,
we can remove it from the xfs_qm_dqflush() code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
There is a race condition that can trigger in dmflakey fstests that
can result in asserts in xfs_ialloc_read_agi() and
xfs_alloc_read_agf() firing. The asserts look like this:
XFS: Assertion failed: pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks), file: fs/xfs/libxfs/xfs_alloc.c, line: 3440
.....
Call Trace:
<TASK>
xfs_alloc_read_agf+0x2ad/0x3a0
xfs_alloc_fix_freelist+0x280/0x720
xfs_alloc_vextent_prepare_ag+0x42/0x120
xfs_alloc_vextent_iterate_ags+0x67/0x260
xfs_alloc_vextent_start_ag+0xe4/0x1c0
xfs_bmapi_allocate+0x6fe/0xc90
xfs_bmapi_convert_delalloc+0x338/0x560
xfs_map_blocks+0x354/0x580
iomap_writepages+0x52b/0xa70
xfs_vm_writepages+0xd7/0x100
do_writepages+0xe1/0x2c0
__writeback_single_inode+0x44/0x340
writeback_sb_inodes+0x2d0/0x570
__writeback_inodes_wb+0x9c/0xf0
wb_writeback+0x139/0x2d0
wb_workfn+0x23e/0x4c0
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30
I've seen the AGI variant from scrub running on the filesysetm
after unmount failed due to systemd interference:
XFS: Assertion failed: pag->pagi_freecount == be32_to_cpu(agi->agi_freecount) || xfs_is_shutdown(pag->pag_mount), file: fs/xfs/libxfs/xfs_ialloc.c, line: 2804
.....
Call Trace:
<TASK>
xfs_ialloc_read_agi+0xee/0x150
xchk_perag_drain_and_lock+0x7d/0x240
xchk_ag_init+0x34/0x90
xchk_inode_xref+0x7b/0x220
xchk_inode+0x14d/0x180
xfs_scrub_metadata+0x2e2/0x510
xfs_ioc_scrub_metadata+0x62/0xb0
xfs_file_ioctl+0x446/0xbf0
__se_sys_ioctl+0x6f/0xc0
__x64_sys_ioctl+0x1d/0x30
x64_sys_call+0x1879/0x2ee0
do_syscall_64+0x68/0x130
? exc_page_fault+0x62/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Essentially, it is the same problem. When _flakey_drop_and_remount()
loads the drop-writes table, it makes all writes silently fail. Writes
are reported to the fs as completed successfully, but they are not
issued to the backing store. The filesystem sees the successful
write completion and marks the metadata buffer clean and removes it
from the AIL.
If this happens at the same time as memory pressure is occuring,
the now-clean AGF and/or AGI buffers can be reclaimed from memory.
Shortly afterwards, but before _flakey_drop_and_remount() runs
unmount, background writeback is kicked and it tries to allocate
blocks for the dirty pages in memory. This then tries to access the
AGF buffer we just turfed out of memory. It's not found, so it gets
read in from disk.
This is all fine, except for the fact that the last writeback of the
AGF did not actually reach disk. The AGF on disk is stale compared
to the in-memory state held by the perag, and so they don't match
and the assert fires.
Then other operations on that inode hang because the task was killed
whilst holding inode locks. e.g:
Workqueue: xfs-conv/dm-12 xfs_end_io
Call Trace:
<TASK>
__schedule+0x650/0xb10
schedule+0x6d/0xf0
schedule_preempt_disabled+0x15/0x30
rwsem_down_write_slowpath+0x31a/0x5f0
down_write+0x43/0x60
xfs_ilock+0x1a8/0x210
xfs_trans_alloc_inode+0x9c/0x240
xfs_iomap_write_unwritten+0xe3/0x300
xfs_end_ioend+0x90/0x130
xfs_end_io+0xce/0x100
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>
and it's all down hill from there.
Memory pressure is one way to trigger this, another is to run "echo
3 > /proc/sys/vm/drop_caches" randomly while tests are running.
Regardless of how it is triggered, this effectively takes down the
system once umount hangs because it's holding a sb->s_umount lock
exclusive and now every sync(1) call gets stuck on it.
Fix this by replacing the asserts with a corruption detection check
and a shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Lock order of xfs_ifree_cluster() is cluster buffer -> try ILOCK
-> IFLUSHING, except for the last inode in the cluster that is
triggering the free. In that case, the lock order is ILOCK ->
cluster buffer -> IFLUSHING.
xfs_iflush_cluster() uses cluster buffer -> try ILOCK -> IFLUSHING,
so this can safely run concurrently with xfs_ifree_cluster().
xfs_inode_item_precommit() uses ILOCK -> cluster buffer, but this
cannot race with xfs_ifree_cluster() so being in a different order
will not trigger a deadlock.
xfs_reclaim_inode() during a filesystem shutdown uses ILOCK ->
IFLUSHING -> cluster buffer via xfs_iflush_shutdown_abort(), and
this deadlocks against xfs_ifree_cluster() like so:
sysrq: Show Blocked State
task:kworker/10:37 state:D stack:12560 pid:276182 tgid:276182 ppid:2 flags:0x00004000
Workqueue: xfs-inodegc/dm-3 xfs_inodegc_worker
Call Trace:
<TASK>
__schedule+0x650/0xb10
schedule+0x6d/0xf0
schedule_timeout+0x8b/0x180
schedule_timeout_uninterruptible+0x1e/0x30
xfs_ifree+0x326/0x730
xfs_inactive_ifree+0xcb/0x230
xfs_inactive+0x2c8/0x380
xfs_inodegc_worker+0xaa/0x180
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>
task:fsync-tester state:D stack:12160 pid:2255943 tgid:2255943 ppid:3988702 flags:0x00004006
Call Trace:
<TASK>
__schedule+0x650/0xb10
schedule+0x6d/0xf0
schedule_timeout+0x31/0x180
__down_common+0xbe/0x1f0
__down+0x1d/0x30
down+0x48/0x50
xfs_buf_lock+0x3d/0xe0
xfs_iflush_shutdown_abort+0x51/0x1e0
xfs_icwalk_ag+0x386/0x690
xfs_reclaim_inodes_nr+0x114/0x160
xfs_fs_free_cached_objects+0x19/0x20
super_cache_scan+0x17b/0x1a0
do_shrink_slab+0x180/0x350
shrink_slab+0xf8/0x430
drop_slab+0x97/0xf0
drop_caches_sysctl_handler+0x59/0xc0
proc_sys_call_handler+0x189/0x280
proc_sys_write+0x13/0x20
vfs_write+0x33d/0x3f0
ksys_write+0x7c/0xf0
__x64_sys_write+0x1b/0x30
x64_sys_call+0x271d/0x2ee0
do_syscall_64+0x68/0x130
entry_SYSCALL_64_after_hwframe+0x76/0x7e
We can't change the lock order of xfs_ifree_cluster() - XFS_ISTALE
and XFS_IFLUSHING are serialised through to journal IO completion
by the cluster buffer lock being held.
There's quite a few asserts in the code that check that XFS_ISTALE
does not occur out of sync with buffer locking (e.g. in
xfs_iflush_cluster). There's also a dependency on the inode log item
being removed from the buffer before XFS_IFLUSHING is cleared, also
with asserts that trigger on this.
Further, we don't have a requirement for the inode to be locked when
completing or aborting inode flushing because all the inode state
updates are serialised by holding the cluster buffer lock across the
IO to completion.
We can't check for XFS_IRECLAIM in xfs_ifree_mark_inode_stale() and
skip the inode, because there is no guarantee that the inode will be
reclaimed. Hence it *must* be marked XFS_ISTALE regardless of
whether reclaim is preparing to free that inode. Similarly, we can't
check for IFLUSHING before locking the inode because that would
result in dirty inodes not being marked with ISTALE in the event of
racing with XFS_IRECLAIM.
Hence we have to address this issue from the xfs_reclaim_inode()
side. It is clear that we cannot hold the inode locked here when
calling xfs_iflush_shutdown_abort() because it is the inode->buffer
lock order that causes the deadlock against xfs_ifree_cluster().
Hence we need to drop the ILOCK before aborting the inode in the
shutdown case. Once we've aborted the inode, we can grab the ILOCK
again and then immediately reclaim it as it is now guaranteed to be
clean.
Note that dropping the ILOCK in xfs_reclaim_inode() means that it
can now be locked by xfs_ifree_mark_inode_stale() and seen whilst in
this state. This is safe because we have left the XFS_IFLUSHING flag
on the inode and so xfs_ifree_mark_inode_stale() will simply set
XFS_ISTALE and move to the next inode. An ASSERT check in this path
needs to be tweaked to take into account this new shutdown
interaction.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmhd/noACgkQE6szbY3K
bnbZrg/+LSNkiwkPI/RExVcwG9I2hsGOcUUUuRiCEDnI6TI+uMt7hOmRJUt8ahQL
jArd2m0mdTus10T8Z7SGRIGgGnszrY5RzYK8I6dOP3SKPo0Wngksp1yv19jaxNGr
t106/I9IP9UJlN4EkmXMKbHEMn9vpYqxkbptYFlvjhBeEKvFFArVb+OOo7O1/mBH
/Q94Z3wsoJqFN5QedoGSqFoG+LmzCi95GZBEbV1vLld37e6uXhJvmLKDEiLinTEJ
BiVOYJwPd20lbHjGXHsdAh0CHX3sYQgm2RpTcwFl/aiPLFAbmgSE4JT7ySmZQuPM
KcxAMbS/Z5vpSmXS4yOivSUoT9J/VwFXGNjzhCTJtixBFTDYRbnFTG4CDsBvJpox
eZOrpGVyx8yg+CaEcVbgMhqrirt3ySEgnLYDj0HoXMB/tf/ACmOfhFft4/NKc7Or
mtF5WI0lrkAuPshMDXr/BRfPHxdB18GNuxWOzgMS3dyyvQzeMZsnvg7m9VfeK6OM
VgthGJEiThtM/ZUS16ZI3YRT72cjcwgxpw8mxSa6A3GM8+mKuJ3Ib3tPl7h7x5wW
BbNtZOgIiSwgxdbfqfdZQ0upz9iceGL0lr8C82Q3DoP5/NYE5ans3kBcSH5dNbEU
AULtqLk6850fNW7/foBW9lw3DDgBTcp8ROF/xt3qiPe4MywCt7s=
=mnLI
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-06-26' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- Lots of small check/repair fixes, primarily in subvol loop and
directory structure loop (when involving snapshots).
- Fix a few 6.16 regressions: rare UAF in the foreground allocator path
when taking a transaction restart from the transaction bump
allocator, and some small fallout from the change to log the error
being corrected in the journal when repairing errors, also some
fallout from the btree node read error logging improvements.
(Alan, Bharadwaj)
- New option: journal_rewind
This lets the entire filesystem be reset to an earlier point in time.
Note that this is only a disaster recovery tool, and right now there
are major caveats to using it (discards should be disabled, in
particular), but it successfully restored the filesystem of one of
the users who was bit by the subvolume deletion bug and didn't have
backups. I'll likely be making some changes to the discard path in
the future to make this a reliable recovery tool.
- Some new btree iterator tracepoints, for tracking down some
livelock-ish behaviour we've been seeing in the main data write path.
* tag 'bcachefs-2025-06-26' of git://evilpiepirate.org/bcachefs: (51 commits)
bcachefs: Plumb correct ip to trans_relock_fail tracepoint
bcachefs: Ensure we rewind to run recovery passes
bcachefs: Ensure btree node scan runs before checking for scanned nodes
bcachefs: btree_root_unreadable_and_scan_found_nothing should not be autofix
bcachefs: fix bch2_journal_keys_peek_prev_min() underflow
bcachefs: Use wait_on_allocator() when allocating journal
bcachefs: Check for bad write buffer key when moving from journal
bcachefs: Don't unlock the trans if ret doesn't match BCH_ERR_operation_blocked
bcachefs: Fix range in bch2_lookup_indirect_extent() error path
bcachefs: fix spurious error_throw
bcachefs: Add missing bch2_err_class() to fileattr_set()
bcachefs: Add missing key type checks to check_snapshot_exists()
bcachefs: Don't log fsck err in the journal if doing repair elsewhere
bcachefs: Fix *__bch2_trans_subbuf_alloc() error path
bcachefs: Fix missing newlines before ero
bcachefs: fix spurious error in read_btree_roots()
bcachefs: fsck: Fix oops in key_visible_in_snapshot()
bcachefs: fsck: fix unhandled restart in topology repair
bcachefs: fsck: Fix check_directory_structure when no check_dirents
bcachefs: Fix restart handling in btree_node_scrub_work()
...
Allow the flexfiles error handling to recognise NFS level errors (as
opposed to RPC level errors) and handle them separately. The main
motivator is the NFSERR_PERM errors that get returned if the NFS client
connects to the data server through a port number that is lower than
1024. In that case, the client should disconnect and retry a READ on a
different data server, or it should retry a WRITE after reconnecting.
Reviewed-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
When performing a file read from RDMA, smbd_recv() prints an "Invalid msg
type 4" error and fails the I/O. This is due to the switch-statement there
not handling the ITER_FOLIOQ handed down from netfslib.
Fix this by collapsing smbd_recv_buf() and smbd_recv_page() into
smbd_recv() and just using copy_to_iter() instead of memcpy(). This
future-proofs the function too, in case more ITER_* types are added.
Fixes: ee4cdf7ba8 ("netfs: Speed up buffered reading")
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Tom Talpey <tom@talpey.com>
cc: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
The handling of received data in the smbdirect client code involves using
copy_to_iter() to copy data from the smbd_reponse struct's packet trailer
to a folioq buffer provided by netfslib that encapsulates a chunk of
pagecache.
If, however, CONFIG_HARDENED_USERCOPY=y, this will result in the checks
then performed in copy_to_iter() oopsing with something like the following:
CIFS: Attempting to mount //172.31.9.1/test
CIFS: VFS: RDMA transport established
usercopy: Kernel memory exposure attempt detected from SLUB object 'smbd_response_0000000091e24ea1' (offset 81, size 63)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102!
...
RIP: 0010:usercopy_abort+0x6c/0x80
...
Call Trace:
<TASK>
__check_heap_object+0xe3/0x120
__check_object_size+0x4dc/0x6d0
smbd_recv+0x77f/0xfe0 [cifs]
cifs_readv_from_socket+0x276/0x8f0 [cifs]
cifs_read_from_socket+0xcd/0x120 [cifs]
cifs_demultiplex_thread+0x7e9/0x2d50 [cifs]
kthread+0x396/0x830
ret_from_fork+0x2b8/0x3b0
ret_from_fork_asm+0x1a/0x30
The problem is that the smbd_response slab's packet field isn't marked as
being permitted for usercopy.
Fix this by passing parameters to kmem_slab_create() to indicate that
copy_to_iter() is permitted from the packet region of the smbd_response
slab objects, less the header space.
Fixes: ee4cdf7ba8 ("netfs: Speed up buffered reading")
Reported-by: Stefan Metzmacher <metze@samba.org>
Link: https://lore.kernel.org/r/acb7f612-df26-4e2a-a35d-7cd040f513e1@samba.org/
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Tested-by: Stefan Metzmacher <metze@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order
and prevent the following deadlock from happening
======================================================
WARNING: possible circular locking dependency detected
6.16.0-rc3-build2+ #1301 Tainted: G S W
------------------------------------------------------
cifsd/6055 is trying to acquire lock:
ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200
but task is already holding lock:
ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&ret_buf->chan_lock){+.+.}-{3:3}:
validate_chain+0x1cf/0x270
__lock_acquire+0x60e/0x780
lock_acquire.part.0+0xb4/0x1f0
_raw_spin_lock+0x2f/0x40
cifs_setup_session+0x81/0x4b0
cifs_get_smb_ses+0x771/0x900
cifs_mount_get_session+0x7e/0x170
cifs_mount+0x92/0x2d0
cifs_smb3_do_mount+0x161/0x460
smb3_get_tree+0x55/0x90
vfs_get_tree+0x46/0x180
do_new_mount+0x1b0/0x2e0
path_mount+0x6ee/0x740
do_mount+0x98/0xe0
__do_sys_mount+0x148/0x180
do_syscall_64+0xa4/0x260
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (&ret_buf->ses_lock){+.+.}-{3:3}:
validate_chain+0x1cf/0x270
__lock_acquire+0x60e/0x780
lock_acquire.part.0+0xb4/0x1f0
_raw_spin_lock+0x2f/0x40
cifs_match_super+0x101/0x320
sget+0xab/0x270
cifs_smb3_do_mount+0x1e0/0x460
smb3_get_tree+0x55/0x90
vfs_get_tree+0x46/0x180
do_new_mount+0x1b0/0x2e0
path_mount+0x6ee/0x740
do_mount+0x98/0xe0
__do_sys_mount+0x148/0x180
do_syscall_64+0xa4/0x260
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&tcp_ses->srv_lock){+.+.}-{3:3}:
check_noncircular+0x95/0xc0
check_prev_add+0x115/0x2f0
validate_chain+0x1cf/0x270
__lock_acquire+0x60e/0x780
lock_acquire.part.0+0xb4/0x1f0
_raw_spin_lock+0x2f/0x40
cifs_signal_cifsd_for_reconnect+0x134/0x200
__cifs_reconnect+0x8f/0x500
cifs_handle_standard+0x112/0x280
cifs_demultiplex_thread+0x64d/0xbc0
kthread+0x2f7/0x310
ret_from_fork+0x2a/0x230
ret_from_fork_asm+0x1a/0x30
other info that might help us debug this:
Chain exists of:
&tcp_ses->srv_lock --> &ret_buf->ses_lock --> &ret_buf->chan_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&ret_buf->chan_lock);
lock(&ret_buf->ses_lock);
lock(&ret_buf->chan_lock);
lock(&tcp_ses->srv_lock);
*** DEADLOCK ***
3 locks held by cifsd/6055:
#0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200
#1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200
#2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200
Cc: linux-cifs@vger.kernel.org
Reported-by: David Howells <dhowells@redhat.com>
Fixes: d7d7a66aac ("cifs: avoid use of global locks for high contention data")
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix a 6.16 regression from the recovery pass rework, which introduced a
bug where calling bch2_run_explicit_recovery_pass() would only return
the error code to rewind recovery for the first call that scheduled that
recovery pass.
If the error code from the first call was swallowed (because it was
called by an asynchronous codepath), subsequent calls would go "ok, this
pass is already marked as needing to run" and return 0.
Fixing this ensures that check_topology bails out to run btree_node_scan
before doing any repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, calling bch2_btree_has_scanned_nodes() when btree node
scan hadn't actually run would erroniously return false - causing us to
think a btree was entirely gone.
This fixes a 6.16 regression from moving the scheduling of btree node
scan out of bch2_btree_lost_data() (fixing the bug where we'd schedule
it persistently in the superblock) and only scheduling it when
check_toploogy() is asking for scanned btree nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaFx0bQAKCRBZ7Krx/gZQ
63yTAQC4NS7qopT8BQGn3aM+t8YjYo36BTeSRcSy4hVEAFrEJAD/WyW5Dcy1lWZR
S8g8rqRimsCepwxqTinYJlS7H8S56ws=
=CmGc
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount fixes from Al Viro:
"Several mount-related fixes"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
userns and mnt_idmap leak in open_tree_attr(2)
attach_recursive_mnt(): do not lock the covering tree when sliding something under it
replace collect_mounts()/drop_collected_mounts() with a safer variant
is_zero_pfn() does not work for the huge zero folio. Fix it by using
is_huge_zero_pmd().
This can cause the PAGEMAP_SCAN ioctl against /proc/pid/pagemap to
present pages as PAGE_IS_PRESENT rather than as PAGE_IS_PFNZERO.
Found by code inspection.
Link: https://lkml.kernel.org/r/20250617143532.2375383-1-david@redhat.com
Fixes: 52526ca7fd ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We should not send smbdirect_data_transfer messages larger than
the negotiated max_send_size, typically 1364 bytes, which means
24 bytes of the smbdirect_data_transfer header + 1340 payload bytes.
This happened when doing an SMB2 write with more than 1340 bytes
(which is done inline as it's below rdma_readwrite_threshold).
It means the peer resets the connection.
When testing between cifs.ko and ksmbd.ko something like this
is logged:
client:
CIFS: VFS: RDMA transport re-established
siw: got TERMINATE. layer 1, type 2, code 2
siw: got TERMINATE. layer 1, type 2, code 2
siw: got TERMINATE. layer 1, type 2, code 2
siw: got TERMINATE. layer 1, type 2, code 2
siw: got TERMINATE. layer 1, type 2, code 2
siw: got TERMINATE. layer 1, type 2, code 2
siw: got TERMINATE. layer 1, type 2, code 2
siw: got TERMINATE. layer 1, type 2, code 2
siw: got TERMINATE. layer 1, type 2, code 2
CIFS: VFS: \\carina Send error in SessSetup = -11
smb2_reconnect: 12 callbacks suppressed
CIFS: VFS: reconnect tcon failed rc = -11
CIFS: VFS: reconnect tcon failed rc = -11
CIFS: VFS: reconnect tcon failed rc = -11
CIFS: VFS: SMB: Zero rsize calculated, using minimum value 65536
and:
CIFS: VFS: RDMA transport re-established
siw: got TERMINATE. layer 1, type 2, code 2
CIFS: VFS: smbd_recv:1894 disconnected
siw: got TERMINATE. layer 1, type 2, code 2
The ksmbd dmesg is showing things like:
smb_direct: Recv error. status='local length error (1)' opcode=128
smb_direct: disconnected
smb_direct: Recv error. status='local length error (1)' opcode=128
ksmbd: smb_direct: disconnected
ksmbd: sock_read failed: -107
As smbd_post_send_iter() limits the transmitted number of bytes
we need loop over it in order to transmit the whole iter.
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Meetakshi Setiya <msetiya@microsoft.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: <stable+noautosel@kernel.org> # sp->max_send_size should be info->max_send_size in backports
Fixes: 3d78fe73fa ("cifs: Build the RDMA SGE list directly from an iterator")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Once want_mount_setattr() has returned a positive, it does require
finish_mount_kattr() to release ->mnt_userns. Failing do_mount_setattr()
does not change that.
As the result, we can end up leaking userns and possibly mnt_idmap as
well.
Fixes: c4a16820d9 ("fs: add open_tree_attr()")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This fixes a bug in commit 63c69ad3d1 ("fuse: refactor
fuse_fill_write_pages()") where max_pages << PAGE_SHIFT is mistakenly
used as the calculation for the max_pages upper limit but there's the
possibility that copy_folio_from_iter_atomic() may copy over bytes
from the iov_iter that are less than the full length of the folio,
which would lead to exceeding max_pages.
This commit fixes it by adding a 'ap->num_folios < max_folios' check.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/20250614000114.910380-1-joannelkoong@gmail.com
Fixes: 63c69ad3d1 ("fuse: refactor fuse_fill_write_pages()")
Tested-by: Brian Foster <bfoster@redhat.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Closes: https://lore.kernel.org/linux-fsdevel/aEq4haEQScwHIWK6@bfoster/
Signed-off-by: Christian Brauner <brauner@kernel.org>
Followup of commit 285975dd67 ("net: annotate data-races around
sk->sk_{rcv|snd}timeo").
Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout()
and add READ_ONCE()/WRITE_ONCE() where it is needed.
Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout()
without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are two bug fixes: 1) double-unlock introduced by the recent folio
conversion, 2) stale page content beyond eof complained by xfstests/generic/363.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmhZmFYACgkQQBSofoJI
UNJUTA/+IvlKf5v7HQXo+1DIwvpwb2dQpYxdXNk1dtTmx3EbCuUazce/ETK5IwLV
25zcLL0EQpY2cyvDksZabtRy8Skfc4Oy7YVNA8XTiNVMtuxzYnB8m0fOGiosRyHO
lzLpmbfVYcv+YEVKMe88Ld1VK2laED2hYRQw4hLLtz+mt+mbBFRV9E2y3cxZt8Aq
pbTglIQ8y1ksc2IKWmk57/ccsB6rV97xBuG9xvIi/D6ve+0568SS3MLI8OoD3/P2
7GDJjHbnDRqi76q+4/LySj9JLs5rN9LDRPTjVIC8ap0gWv4kqZuAU1u+Si3YRglz
YWKn25poMyGNecPytl3k9a6xgMT3LTCSLtE7t4NYIy4v4hCoxj+AqxwqmQiigWGg
/gtIs+BS41KOX/+AxMDJipXoL0qt3ArcwByaumc6IgpF4E01wj2+MCONHjMoTGIB
In+QgDy/utS12N+zGcb5EwUBPWMGPAJmXVzF6DoKWlMdjlcmcYd6X0QvUZb0SGdS
LMjD6PzlAlTBE6qwAz2LQ6zqb/bdW6wsH8YX1AlaV0m46J2sgprNwbHMZi8poRU8
gSibfRitCC/F71cC8SPXTkZiedCbivuvitYyM6oAcR4ZB76Dlty+hGMiv29M2kxw
KAFe6OD/cAgsDkv7fMPshd5148YCb/hybu2LrcyeqZMNnptrhf8=
=suF5
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
- fix double-unlock introduced by the recent folio conversion
- fix stale page content beyond EOF complained by xfstests/generic/363
* tag 'f2fs-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix to zero post-eof page
f2fs: Fix __write_node_folio() conversion
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhZQ9MACgkQxWXV+ddt
WDsyvw/+K5N4zbig9D5QL5SdsQwMe/ZUk1KF0LLu6H3hFetdICeM/Z4K46EBh40X
c9Sxb13gLnIAm8DR/IFTTlOZVrrbJ3CTazZuJbncCpaZchH863aYb/1KboxjJnpW
KqOen20KdUh8HdevrJFhkFc7rOjp7KupfIHsbWqIxaWYPf8ORvUyK55lKxQz0HES
E5tFXLNr6z/8Ws5pc2HnRLgnRcCHuRUNJUb1PEaTfPKxoFvTwjda6cDsYnXOJEO9
NOnh6lluurqja+3FUEFig2f292/CbKGtByYUDgfhHO21P//IHSDhlouvwipzI/kh
6WUoH1K+DWCxxNbIVFFbUYLxrDGu7R7/aWFHH2q0dNjqQeiQBbUnbn4WIjAAwDWf
k9cmE+WgVqwQI+vpfG3eENUafG5MpcQQo2wKrxG0whWaC2fiA6QtI+3DfKyMj4XJ
JI1jUhfCwHrqzoGQ4XBE3UYENqQw9RICNC+Z3UfZx+5sQMWcb+ac5qIGygvCfU8N
Gtfx4ladZshpQUSuRneiLozxdxLyXX3LzCt2Ls1s5fPPikZft/+2QRu5rzSbb/Cp
50TDSn/pE1N/TEMVZaP5M2PxquBVDOZ4TFSsSm3IvceqFInm0UerAGaJ7+T2eZhM
3XHhIp6xTecHfwukvGqs+XSxB9PMLfF5M0gc+9PR+3oxzFRpowI=
=XLWR
-----END PGP SIGNATURE-----
Merge tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fixes:
- fix invalid inode pointer dereferences during log replay
- fix a race between renames and directory logging
- fix shutting down delayed iput worker
- fix device byte accounting when dropping chunk
- in zoned mode, fix offset calculations for DUP profile when
conventional and sequential zones are used together
Regression fixes:
- fix possible double unlock of extent buffer tree (xarray
conversion)
- in zoned mode, fix extent buffer refcount when writing out extents
(xarray conversion)
Error handling fixes and updates:
- handle unexpected extent type when replaying log
- check and warn if there are remaining delayed inodes when putting a
root
- fix assertion when building free space tree
- handle csum tree error with mount option 'rescue=ibadroot'
Other:
- error message updates: add prefix to all scrub related messages,
include other information in messages"
* tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix alloc_offset calculation for partly conventional block groups
btrfs: handle csum tree error with rescue=ibadroots correctly
btrfs: fix race between async reclaim worker and close_ctree()
btrfs: fix assertion when building free space tree
btrfs: don't silently ignore unexpected extent type when replaying log
btrfs: fix invalid inode pointer dereferences during log replay
btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb
btrfs: update superblock's device bytes_used when dropping chunk
btrfs: fix a race between renames and directory logging
btrfs: scrub: add prefix for the error messages
btrfs: warn if leaking delayed_nodes in btrfs_put_root()
btrfs: fix delayed ref refcount leak in debug assertion
btrfs: include root in error message when unlinking inode
btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
If we are propagating across the userns boundary, we need to lock the
mounts added there. However, in case when something has already
been mounted there and we end up sliding a new tree under that,
the stuff that had been there before should not get locked.
IOW, lock_mnt_tree() should be called before we reparent the
preexisting tree on top of what we are adding.
Fixes: 3bd045cc9c ("separate copying and locking mount tree on cross-userns copies")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
collect_mounts() has several problems - one can't iterate over the results
directly, so it has to be done with callback passed to iterate_mounts();
it has an oopsable race with d_invalidate(); it creates temporary clones
of mounts invisibly for sync umount (IOW, you can have non-lazy umount
succeed leaving filesystem not mounted anywhere and yet still busy).
A saner approach is to give caller an array of struct path that would pin
every mount in a subtree, without cloning any mounts.
* collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone
* collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or
a pointer to array of struct path, one for each chunk of tree visible under
'where' (i.e. the first element is a copy of where, followed by (mount,root)
for everything mounted under it - the same set collect_mounts() would give).
Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning
references to the roots of subtrees in the caller's namespace.
Array is terminated by {NULL, NULL} struct path. If it fits into
preallocated array (on-stack, normally), that's where it goes; otherwise
it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated'
is ignored (and expected to be NULL).
* drop_collected_paths(paths, preallocated) is given the array returned
by an earlier call of collect_paths() and the preallocated array passed to that
call. All mount/dentry references are dropped and array is kfree'd if it's not
equal to 'preallocated'.
* instead of iterate_mounts(), users should just iterate over array
of struct path - nothing exotic is needed for that. Existing users (all in
audit_tree.c) are converted.
[folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>]
Fixes: 80b5dce8c5 ("vfs: Add a function to lazily unmount all mounts from any dentry")
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We found a few different systems hung up in writeback waiting on the same
page lock, and one task waiting on the NFS_LAYOUT_DRAIN bit in
pnfs_update_layout(), however the pnfs_layout_hdr's plh_outstanding count
was zero.
It seems most likely that this is another race between the waiter and waker
similar to commit ed0172af5d ("SUNRPC: Fix a race to wake a sync task").
Fix it up by applying the advised barrier.
Fixes: 880265c77a ("pNFS: Avoid a live lock condition in pnfs_update_layout()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Some users and customers reported that their backup/copy tools started
to fail when the directory being copied contained symlink targets that
the client couldn't parse - even when those symlinks weren't followed.
Fix this by allowing lstat(2) and readlink(2) to succeed even when the
client can't resolve the symlink target, restoring old behavior.
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Reported-by: Remy Monsen <monsen@monsen.cc>
Closes: https://lore.kernel.org/r/CAN+tdP7y=jqw3pBndZAGjQv0ObFq8Q=+PUDHgB36HdEz9QA6FQ@mail.gmail.com
Reported-by: Pierguido Lambri <plambri@redhat.com>
Fixes: 12b466eb52 ("cifs: Fix creating and resolving absolute NT-style symlinks")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Export anon_inode_make_secure_inode() to allow KVM guest_memfd to create
anonymous inodes with proper security context. This replaces the current
pattern of calling alloc_anon_inode() followed by
inode_init_security_anon() for creating security context manually.
This change also fixes a security regression in secretmem where the
S_PRIVATE flag was not cleared after alloc_anon_inode(), causing
LSM/SELinux checks to be bypassed for secretmem file descriptors.
As guest_memfd currently resides in the KVM module, we need to export this
symbol for use outside the core kernel. In the future, guest_memfd might be
moved to core-mm, at which point the symbols no longer would have to be
exported. When/if that happens is still unclear.
Fixes: 2bfe15c526 ("mm: create security context for memfd_secret inodes")
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Shivank Garg <shivankg@amd.com>
Link: https://lore.kernel.org/20250620070328.803704-3-shivankg@amd.com
Acked-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
alternatives-patched doesn't get mishandled by out-of-order modifications,
leading to it overflowing and causing page faults when patching
- Avoid an infinite loop when early code does a ranged TLB invalidation before
the broadcast TLB invalidation count of how many pages it can flush, has
been read from CPUID
- Fix a CONFIG_MODULES typo
- Disable broadcast TLB invalidation when PTI is enabled to avoid an overflow
of the bitmap tracking dynamic ASIDs which need to be flushed when the
kernel switches between the user and kernel address space
- Handle the case of a CPU going offline and thus reporting zeroes when
reading top-level events in the resctrl code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhXwaMACgkQEsHwGGHe
VUr6OQ//ThzgO6GdONGcLgTdiQ/c3Ug9tUf3k3uUeZhGp7MypEZgwBzHcU6jbkcX
XYOdV5wJw+eetyj1/ueJqXBvQRWox6lOT064RZDwmmtJCRcSSAOxa5lbrHfnlgcd
y9j7ml5cS/NzIfldiSynlE523SiYz2EOV0x0kY3xGkHdOT98VJHOpahgaaWmizvQ
MEnv1AosYFGlwonq5kGu0vV8V4N0VRIe7/FjAxSuSNLUHwVdACqX9oyAdYbyw09p
oNZjpozBJBNe70Wkg35ijQKlNGKkngwnPILmRgjZwHDH2FIRzrr6CqmHx6z336b+
FkThRmVzJZfhtQVFM9V/prBB3GW8j/zIm61j67iz2yJW1Biz3xJKnUT+s7jd4ARr
gGBAB+41UoaPsVKrrSQYlCsrFkUWn6b4AFT+l1eQDgnuc+ajX+OyHHXC+qVrBUbK
lL80EoYyPrEMh8pP9wx7hdtVCvOTOSSSdFPe7RrnaT7satoBI/Aemz9tZqVzmkCs
48iLZ4Xbwswo8U9mKIwap3IksG43LvDLWK1ydw+U6OzBuYGmu148JcSBKO3G1uhm
YkRvmykNTxKO62mki8Wc86VP5IB5KxBcS6nO6GAGT3vsKPrkzi2GkqvaDi6Z0bEL
1kjIaN/XKG4OQglba/DWoRLBXmd9LtsVrCJ1QGuY+ejYUJYLLMA=
=NT0q
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure the array tracking which kernel text positions need to be
alternatives-patched doesn't get mishandled by out-of-order
modifications, leading to it overflowing and causing page faults when
patching
- Avoid an infinite loop when early code does a ranged TLB invalidation
before the broadcast TLB invalidation count of how many pages it can
flush, has been read from CPUID
- Fix a CONFIG_MODULES typo
- Disable broadcast TLB invalidation when PTI is enabled to avoid an
overflow of the bitmap tracking dynamic ASIDs which need to be
flushed when the kernel switches between the user and kernel address
space
- Handle the case of a CPU going offline and thus reporting zeroes when
reading top-level events in the resctrl code
* tag 'x86_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Fix int3 handling failure from broken text_poke array
x86/mm: Fix early boot use of INVPLGB
x86/its: Fix an ifdef typo in its_alloc()
x86/mm: Disable INVLPGB when PTI is enabled
x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl subsystem
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmhW2HgACgkQiiy9cAdy
T1GOKQwAtXGEBIPsMMIDUISCFRVMFShbDdROFZHEUqGLM3Ere/tzahj1hF+zFC+I
um0h6swEyYod72Ay4Ltkfupk6++6loonSQibZiXZMJRQzJywhPd/d+W6TyE512bf
uyCYd0/DvC7hvkDXlxSrukDdJ/qbBggdoIMYI/SU9sZepURc+nl5Dud9sT3dXXxr
1+tr6hRorOUr6Tot7oPnJNHIUvFmRSo+aYLQbHzheYLwBYn3b7yq8Ep9txDr6z+v
WYaxMsy9ukxCgaaPKsZmGBrD0D4wE+MzOgPA+f5zZhxOGpYQF+GLbovabyynqPff
XyUY2NPMECQTyZaCpiq0ydORCkusgqyCEK+GazHjebuKosutHOzUyOvsZBvKS1ww
dQW4Ukjn3ge1eXeRuS1BwvBRD2okB2ODyelG4QDzw93Jbh/jwuhCxyw0Cc76Umps
mOaeCujd93X/hpBwA+KYPxlpzUnf1TzhOkP2RUWyKKuibE8WDcUyJVEnMa3mC3NR
aPmK/F07
=vZhQ
-----END PGP SIGNATURE-----
Merge tag 'v6.16-rc2-smb3-client-fixes-v2' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Multichannel channel allocation fix for Kerberos mounts
- Two reconnect fixes
- Fix netfs_writepages crash with smbdirect/RDMA
- Directory caching fix
- Three minor cleanup fixes
- Log error when close cached dirs fails
* tag 'v6.16-rc2-smb3-client-fixes-v2' of git://git.samba.org/sfrench/cifs-2.6:
smb: minor fix to use SMB2_NTLMV2_SESSKEY_SIZE for auth_key size
smb: minor fix to use sizeof to initialize flags_string buffer
smb: Use loff_t for directory position in cached_dirents
smb: Log an error when close_all_cached_dirs fails
cifs: Fix prepare_write to negotiate wsize if needed
smb: client: fix max_sge overflow in smb_extract_folioq_to_rdma()
smb: client: fix first command failure during re-negotiation
cifs: Remove duplicate fattr->cf_dtype assignment from wsl_to_fattr() function
smb: fix secondary channel creation issue with kerberos by populating hostname when adding channels
Before calling bch2_indirect_extent_missing_error(), we have to
calculate the missing range, which is the intersection of the reflink
pointer and the non-indirect-extent we found.
The calculation didn't take into account that the returned extent may
span the iter position, leading to an infinite loop when we
(unnecessarily) resized the extent we were returning to one that didn't
extend past the offset we were looking up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Two fixes for commits in the nfsd-6.16 merge
- One fix for the recently-added NFSD netlink facility
- One fix for a remote SunRPC crasher
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmhWykQACgkQM2qzM29m
f5cfIA/+MtqpF8iHCNWBk2mmgQHtfS78qhQexEsQ69F1k90+bzCJ9Mu7AZT67xRZ
2E6Fm3gyJuFPaJ9YOGewyD6OV/eE0SioPQteZIIwUVclKcoyVtqJpc57A/aQfMJm
Te1bPWYrc5jfxoF/QzkOjzd4KWXOej2nIqsybFIDK5uTe7gn9QpPiFL8ePRuf7bJ
5oWBTZPpmuWpDcebQrz56OT0bvvrCNH9EV9TZNYwVa/B7yl3UNHwM2YB5O9m2Rl7
GUeWytc2Y7WgdDJe9poPwNrYyXFb8JX4pCbS54q5wCRLYLDgo8PsgsFCAFfgjx0l
ArATKDhzfYK0azASDtJma2eCJkrm4/nXRNFON7xucN79pYRh9GChcmhhfubb8Yo3
HopUaCSLOgy7PNsbbz3lNREn5uBpomeK1Mb88/UkJ/d8fmKGIO1ITTlarazYv2w5
GgmzZSBS43VdUmipqNyddIMg2uy7uaeAQZpr4jHVFVctPypjXCn/zIrhG2mmjRhW
M7PCJqtFj0a7DM48LA6r9K8N4Yi7EoSQKNAKfwduE++XXYWgBk9blaF/7om/fIRJ
VJm/RfiuFuyGaxCLd7WpPpASeS/nq410BMV5/kxol+UgiCxk8xjsvU7y8OWLOl7S
rUend5Yei7/h0NGoSmQPnBoYnb1YuTeMMzjfaRWSsV5gbvVDejQ=
=1C8u
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Two fixes for commits in the nfsd-6.16 merge
- One fix for the recently-added NFSD netlink facility
- One fix for a remote SunRPC crasher
* tag 'nfsd-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
sunrpc: handle SVC_GARBAGE during svc auth processing as auth error
nfsd: use threads array as-is in netlink interface
SUNRPC: Cleanup/fix initial rq_pages allocation
NFSD: Avoid corruption of a referring call list
Replaced hardcoded value 16 with SMB2_NTLMV2_SESSKEY_SIZE
in the auth_key definition and memcpy call.
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Replaced hardcoded length with sizeof(flags_string).
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Change the pos field in struct cached_dirents from int to loff_t
to support large directory offsets. This avoids overflow and
matches kernel conventions for directory positions.
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Under low-memory conditions, close_all_cached_dirs() can't move the
dentries to a separate list to dput() them once the locks are dropped.
This will result in a "Dentry still in use" error, so add an error
message that makes it clear this is what happened:
[ 495.281119] CIFS: VFS: \\otters.example.com\share Out of memory while dropping dentries
[ 495.281595] ------------[ cut here ]------------
[ 495.281887] BUG: Dentry ffff888115531138{i=78,n=/} still in use (2) [unmount of cifs cifs]
[ 495.282391] WARNING: CPU: 1 PID: 2329 at fs/dcache.c:1536 umount_check+0xc8/0xf0
Also, bail out of looping through all tcons as soon as a single
allocation fails, since we're already in trouble, and kmalloc() attempts
for subseqeuent tcons are likely to fail just like the first one did.
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Acked-by: Bharath SM <bharathsm@microsoft.com>
Suggested-by: Ruben Devos <rdevos@oxya.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix cifs_prepare_write() to negotiate the wsize if it is unset.
Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
after fabc4ed200f9, server_unresponsive add a condition to check whether client
need to reconnect depending on server->lstrp. When client failed to reconnect
for some time and abort connection, server->lstrp is updated for the last time.
In the following scene, server->lstrp is too old. This cause next command
failure in re-negotiation rather than waiting for re-negotiation done.
1. mount -t cifs -o username=Everyone,echo_internal=10 //$server_ip/export /mnt
2. ssh $server_ip "echo b > /proc/sysrq-trigger &"
3. ls /mnt
4. sleep 21s
5. ssh $server_ip "service firewalld stop"
6. ls # return EHOSTDOWN
If the interval between 5 and 6 is too small, 6 may trigger sending negotiation
request. Before backgrounding cifsd thread try to receive negotiation response
from server in cifs_readv_from_socket, server_unresponsive may trigger
cifs_reconnect which cause 6 to be failed:
ls thread
----------------
smb2_negotiate
server->tcpStatus = CifsInNegotiate
compound_send_recv
wait_for_compound_request
cifsd thread
----------------
cifs_readv_from_socket
server_unresponsive
server->tcpStatus == CifsInNegotiate && jiffies > server->lstrp + 20s
cifs_reconnect
cifs_abort_connection: mid_state = MID_RETRY_NEEDED
ls thread
----------------
cifs_sync_mid_result return EAGAIN
smb2_negotiate return EHOSTDOWN
Though server->lstrp means last server response time, it is updated in
cifs_abort_connection and cifs_get_tcp_session. We can also update server->lstrp
before switching into CifsInNegotiate state to avoid failure in 6.
Fixes: 7ccc146546 ("smb: client: fix hang in wait_for_response() for negproto")
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Acked-by: Meetakshi Setiya <msetiya@microsoft.com>
Signed-off-by: zhangjian <zhangjian496@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
For now we only have one key type in these btrees, but forward
compatibility means we do have to check.
Reported-by: syzbot+b4cb4a6988aced0cec4b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes exceeding the bump allocator limit when the allocator finds
many buckets that need repair - they're repaired asynchronously, which
means that every error logged a message in the bump allocator, without
committing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Don't change buf->size on error - this would usually be a transaction
restart, but it could also be -ENOMEM - when we've exceeded the bump
allocator max).
Fixes: 247abee6ae ("bcachefs: btree_trans_subbuf")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The old nfsdfs interface for starting a server with multiple pools
handles the special case of a single entry array passed down from
userland by distributing the threads over every NUMA node.
The netlink control interface however constructs an array of length
nfsd_nrpools() and fills any unprovided slots with 0's. This behavior
defeats the special casing that the old interface relies on.
Change nfsd_nl_threads_set_doit() to pass down the array from userland
as-is.
Fixes: 7f5c330b26 ("nfsd: allow passing in array of thread counts via netlink")
Cc: stable@vger.kernel.org
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/aDC-ftnzhJAlwqwh@kernel.org/
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When one of two zones composing a DUP block group is a conventional zone,
we have the zone_info[i]->alloc_offset = WP_CONVENTIONAL. That will, of
course, not match the write pointer of the other zone, and fails that
block group.
This commit solves that issue by properly recovering the emulated write
pointer from the last allocated extent. The offset for the SINGLE, DUP,
and RAID1 are straight-forward: it is same as the end of last allocated
extent. The RAID0 and RAID10 are a bit tricky that we need to do the math
of striping.
This is the kernel equivalent of Naohiro's user-space commit:
"btrfs-progs: zoned: fix alloc_offset calculation for partly
conventional block groups".
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is syzbot based reproducer that can crash the kernel, with the
following call trace: (With some debug output added)
DEBUG: rescue=ibadroots parsed
BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010)
BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8
BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm
BTRFS info (device loop0): using free-space-tree
BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0
DEBUG: read tree root path failed for tree csum, ret=-5
BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0
BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0
process 'repro' launched './file2' with NULL argv: empty string added
DEBUG: no csum root, idatacsums=0 ibadroots=134217728
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f]
CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G OE 6.15.0-custom+ #249 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs]
Call Trace:
<TASK>
btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs]
btrfs_submit_bbio+0x43e/0x1a80 [btrfs]
submit_one_bio+0xde/0x160 [btrfs]
btrfs_readahead+0x498/0x6a0 [btrfs]
read_pages+0x1c3/0xb20
page_cache_ra_order+0x4b5/0xc20
filemap_get_pages+0x2d3/0x19e0
filemap_read+0x314/0xde0
__kernel_read+0x35b/0x900
bprm_execve+0x62e/0x1140
do_execveat_common.isra.0+0x3fc/0x520
__x64_sys_execveat+0xdc/0x130
do_syscall_64+0x54/0x1d0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
---[ end trace 0000000000000000 ]---
[CAUSE]
Firstly the fs has a corrupted csum tree root, thus to mount the fs we
have to go "ro,rescue=ibadroots" mount option.
Normally with that mount option, a bad csum tree root should set
BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will
ignore csum search.
But in this particular case, we have the following call trace that
caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS:
load_global_roots_objectid():
ret = btrfs_search_slot();
/* Succeeded */
btrfs_item_key_to_cpu()
found = true;
/* We found the root item for csum tree. */
root = read_tree_root_path();
if (IS_ERR(root)) {
if (!btrfs_test_opt(fs_info, IGNOREBADROOTS))
/*
* Since we have rescue=ibadroots mount option,
* @ret is still 0.
*/
break;
if (!found || ret) {
/* @found is true, @ret is 0, error handling for csum
* tree is skipped.
*/
}
This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if
the csum tree is corrupted, which results unexpected later csum lookup.
[FIX]
If read_tree_root_path() failed, always populate @ret to the error
number.
As at the end of the function, we need @ret to determine if we need to
do the extra error handling for csum tree.
Fixes: abed4aaae4 ("btrfs: track the csum, extent, and free space trees in a rb tree")
Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com>
Reported-by: Longxing Li <coregee2000@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Syzbot reported an assertion failure due to an attempt to add a delayed
iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info
state:
WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
Modules linked in:
CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Workqueue: btrfs-endio-write btrfs_work_helper
RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
Code: 4e ad 5d (...)
RSP: 0018:ffffc9000213f780 EFLAGS: 00010293
RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00
RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000
RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c
R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001
R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100
FS: 0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635
btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312
btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312
process_one_work kernel/workqueue.c:3238 [inline]
process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
kthread+0x70e/0x8a0 kernel/kthread.c:464
ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
This can happen due to a race with the async reclaim worker like this:
1) The async metadata reclaim worker enters shrink_delalloc(), which calls
btrfs_start_delalloc_roots() with an nr_pages argument that has a value
less than LONG_MAX, and that in turn enters start_delalloc_inodes(),
which sets the local variable 'full_flush' to false because
wbc->nr_to_write is less than LONG_MAX;
2) There it finds inode X in a root's delalloc list, grabs a reference for
inode X (with igrab()), and triggers writeback for it with
filemap_fdatawrite_wbc(), which creates an ordered extent for inode X;
3) The unmount sequence starts from another task, we enter close_ctree()
and we flush the workqueue fs_info->endio_write_workers, which waits
for the ordered extent for inode X to complete and when dropping the
last reference of the ordered extent, with btrfs_put_ordered_extent(),
when we call btrfs_add_delayed_iput() we don't add the inode to the
list of delayed iputs because it has a refcount of 2, so we decrement
it to 1 and return;
4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which
runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT
in the fs_info state;
5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now
calls btrfs_add_delayed_iput() for inode X and there we trigger an
assertion failure since the fs_info state has the flag
BTRFS_FS_STATE_NO_DELAYED_IPUT set.
Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for
the async reclaim workers to finish, after we call cancel_work_sync() for
them at close_ctree(), and by running delayed iputs after wait for the
reclaim workers to finish and before setting the bit.
This race was recently introduced by commit 19e60b2a95 ("btrfs: add
extra warning if delayed iput is added when it's not allowed"). Without
the new validation at btrfs_add_delayed_iput(), this described scenario
was safe because close_ctree() later calls btrfs_commit_super(). That
will run any final delayed iputs added by reclaim workers in the window
between the btrfs_run_delayed_iputs() and the the reclaim workers being
shut down.
Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u
Fixes: 19e60b2a95 ("btrfs: add extra warning if delayed iput is added when it's not allowed")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When building the free space tree with the block group tree feature
enabled, we can hit an assertion failure like this:
BTRFS info (device loop0 state M): rebuilding free space tree
assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102
------------[ cut here ]------------
kernel BUG at fs/btrfs/free-space-tree.c:1102!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
Modules linked in:
CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102
lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102
sp : ffff8000a4ce7600
x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8
x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001
x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160
x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff
x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0
x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff
x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00
x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0
x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e
Call trace:
populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P)
btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337
btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074
btrfs_remount_rw fs/btrfs/super.c:1319 [inline]
btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543
reconfigure_super+0x1d4/0x6f0 fs/super.c:1083
do_remount fs/namespace.c:3365 [inline]
path_mount+0xb34/0xde0 fs/namespace.c:4200
do_mount fs/namespace.c:4221 [inline]
__do_sys_mount fs/namespace.c:4432 [inline]
__se_sys_mount fs/namespace.c:4409 [inline]
__arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600
Code: f0047182 91178042 528089c3 9771d47b (d4210000)
---[ end trace 0000000000000000 ]---
This happens because we are processing an empty block group, which has
no extents allocated from it, there are no items for this block group,
including the block group item since block group items are stored in a
dedicated tree when using the block group tree feature. It also means
this is the block group with the highest start offset, so there are no
higher keys in the extent root, hence btrfs_search_slot_for_read()
returns 1 (no higher key found).
Fix this by asserting 'ret' is 0 only if the block group tree feature
is not enabled, in which case we should find a block group item for
the block group since it's stored in the extent root and block group
item keys are greater than extent item keys (the value for
BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and
BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively).
In case 'ret' is 1, we just need to add a record to the free space
tree which spans the whole block group, and we can achieve this by
making 'ret == 0' as the while loop's condition.
Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If there's an unexpected (invalid) extent type, we just silently ignore
it. This means a corruption or some bug somewhere, so instead return
-EUCLEAN to the caller, making log replay fail, and print an error message
with relevant information.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a few places where we call read_one_inode(), if we get a NULL pointer
we end up jumping into an error path, or fallthrough in case of
__add_inode_ref(), where we then do something like this:
iput(&inode->vfs_inode);
which results in an invalid inode pointer that triggers an invalid memory
access, resulting in a crash.
Fix this by making sure we don't do such dereferences.
Fixes: b4c50cbb01 ("btrfs: return a btrfs_inode from read_one_inode()")
CC: stable@vger.kernel.org # 6.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we break out of the loop because an extent buffer doesn't have the bit
EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once
before we tested for the bit and break out of the loop, and once again
after the loop.
Fix this by testing the bit and exiting before unlocking the xarray.
The time spent testing the bit is negligible and it's not worth trying
to do that outside the critical section delimited by the xarray lock due
to the code complexity required to avoid it (like using a local boolean
variable to track whether the xarray is locked or not). The xarray unlock
only needs to be done before calling release_extent_buffer(), as that
needs to lock the xarray (through xa_cmpxchg_irq()) and does a more
significant amount of work.
Fixes: 19d7f65f03 ("btrfs: convert the buffer_radix to an xarray")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Each superblock contains a copy of the device item for that device. In a
transaction which drops a chunk but doesn't create any new ones, we were
correctly updating the device item in the chunk tree but not copying
over the new bytes_used value to the superblock.
This can be seen by doing the following:
# dd if=/dev/zero of=test bs=4096 count=2621440
# mkfs.btrfs test
# mount test /root/temp
# cd /root/temp
# for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done
# sync
# rm *
# sync
# btrfs balance start -dusage=0 .
# sync
# cd
# umount /root/temp
# btrfs check test
For btrfs-check to detect this, you will also need my patch at
https://github.com/kdave/btrfs-progs/pull/991.
Change btrfs_remove_dev_extents() so that it adds the devices to the
fs_info->post_commit_list if they're not there already. This causes
btrfs_commit_device_sizes() to be called, which updates the bytes_used
value in the superblock.
Fixes: bbbf7243d6 ("btrfs: combine device update operations during transaction commit")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a race between a rename and directory inode logging that if it
happens and we crash/power fail before the rename completes, the next time
the filesystem is mounted, the log replay code will end up deleting the
file that was being renamed.
This is best explained following a step by step analysis of an interleaving
of steps that lead into this situation.
Consider the initial conditions:
1) We are at transaction N;
2) We have directories A and B created in a past transaction (< N);
3) We have inode X corresponding to a file that has 2 hardlinks, one in
directory A and the other in directory B, so we'll name them as
"A/foo_link1" and "B/foo_link2". Both hard links were persisted in a
past transaction (< N);
4) We have inode Y corresponding to a file that as a single hard link and
is located in directory A, we'll name it as "A/bar". This file was also
persisted in a past transaction (< N).
The steps leading to a file loss are the following and for all of them we
are under transaction N:
1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field
is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir();
2) Task A starts a rename for inode Y, with the goal of renaming from
"A/bar" to "A/baz", so we enter btrfs_rename();
3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling
btrfs_insert_inode_ref();
4) Because the rename happens in the same directory, we don't set the
last_unlink_trans field of directoty A's inode to the current
transaction id, that is, we don't cal btrfs_record_unlink_dir();
5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY
and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode()
(actually the dir index item is added as a delayed item, but the
effect is the same);
6) Now before task A adds the new entry "A/baz" to directory A by
calling btrfs_add_link(), another task, task B is logging inode X;
7) Task B starts a fsync of inode X and after logging inode X, at
btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since
inode X has a last_unlink_trans value of N, set at in step 1;
8) At btrfs_log_all_parents() we search for all parent directories of
inode X using the commit root, so we find directories A and B and log
them. Bu when logging direct A, we don't have a dir index item for
inode Y anymore, neither the old name "A/bar" nor for the new name
"A/baz" since the rename has deleted the old name but has not yet
inserted the new name - task A hasn't called yet btrfs_add_link() to
do that.
Note that logging directory A doesn't fallback to a transaction
commit because its last_unlink_trans has a lower value than the
current transaction's id (see step 4);
9) Task B finishes logging directories A and B and gets back to
btrfs_sync_file() where it calls btrfs_sync_log() to persist the log
tree;
10) Task B successfully persisted the log tree, btrfs_sync_log() completed
with success, and a power failure happened.
We have a log tree without any directory entry for inode Y, so the
log replay code deletes the entry for inode Y, name "A/bar", from the
subvolume tree since it doesn't exist in the log tree and the log
tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY
item that covers the index range for the dentry that corresponds to
"A/bar").
Since there's no other hard link for inode Y and the log replay code
deletes the name "A/bar", the file is lost.
The issue wouldn't happen if task B synced the log only after task A
called btrfs_log_new_name(), which would update the log with the new name
for inode Y ("A/bar").
Fix this by pinning the log root during renames before removing the old
directory entry, and unpinning after btrfs_log_new_name() is called.
Fixes: 259c4b96d7 ("btrfs: stop doing unnecessary log updates during a rename")
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a "scrub: " prefix to all messages logged by scrub so that it's
easy to filter them from dmesg for analysis.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a warning for leaked delayed_nodes when putting a root. We currently
do this for inodes, but not delayed_nodes.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Remove the changelog from the commit message. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the delayed_root is not empty we are increasing the number of
references to a delayed_node without decreasing it, causing a leak. Fix
by decrementing the delayed_node reference count.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Remove the changelog from the commit message. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To help debugging include the root number in the error message, and since
this is a critical error that implies a metadata inconsistency and results
in a transaction abort change the log message level from "info" to
"critical", which is a much better fit.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmhR3jgACgkQiiy9cAdy
T1HWigv+JNzSJxgriJZMDHJcY3rJij01eyzewEkwzx04+vjZIRNpmnY//eC7u1MM
T8qxLpb7nsQO487H0L8Kf1MDWnpOXa0Ilh8eP8dSs3SlxF8DTAmIHWXGhtl01McJ
j0zJqpr9JPTuKQ6QkgweA7i98hdu27xl7qHLymB8co6BaoVSojBku2EZe/prGxBt
bCXrfmlB+BOfW90zJpD9wQAHKB/TZx6nL5PNsyvIBJFIe84T1VtGrr/G2h5T70pM
S7o+buoqXsiXoNbGneeBB3abvYhmEVlh0CKxCma54ZUQnMvnFHdXWQlk9m6pxuFJ
XD78/xjDIHGYLY+4uoAvJ6dJuleBO5YK70SSrz37E3LKHtezWT3Ibj8l27dzwch4
bOSb1tTGiQkPApOX0DwZ4IEtNFiNcJOY2q4NaTSN99JO3vryMT2dIEh3XNqCYcmU
1bCA7e91+nl8lo4W+8RF+NVjLjN83FJzpDR84Qml/3DVPBG6eO4eiheUGbHVXLiy
U/8czelb
=igfH
-----END PGP SIGNATURE-----
Merge tag '6.16-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix alternate data streams bug
- Important fix for null pointer deref with Kerberos authentication
- Fix oops in smbdirect (RDMA) in free_transport
* tag '6.16-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: handle set/get info file for streamed file
ksmbd: fix null pointer dereference in destroy_previous_session
ksmbd: add free_transport ops in ksmbd connection
fstest reports a f2fs bug:
generic/363 42s ... [failed, exit status 1]- output mismatch (see /share/git/fstests/results//generic/363.out.bad)
--- tests/generic/363.out 2025-01-12 21:57:40.271440542 +0800
+++ /share/git/fstests/results//generic/363.out.bad 2025-05-19 19:55:58.000000000 +0800
@@ -1,2 +1,78 @@
QA output created by 363
fsx -q -S 0 -e 1 -N 100000
+READ BAD DATA: offset = 0xd6fb, size = 0xf044, fname = /mnt/f2fs/junk
+OFFSET GOOD BAD RANGE
+0x1540d 0x0000 0x2a25 0x0
+operation# (mod 256) for the bad data may be 37
+0x1540e 0x0000 0x2527 0x1
...
(Run 'diff -u /share/git/fstests/tests/generic/363.out /share/git/fstests/results//generic/363.out.bad' to see the entire diff)
Ran: generic/363
Failures: generic/363
Failed 1 of 1 tests
The root cause is user can update post-eof page via mmap [1], however, f2fs
missed to zero post-eof page in below operations, so, once it expands i_size,
then it will include dummy data locates previous post-eof page, so during
below operations, we need to zero post-eof page.
Operations which can include dummy data after previous i_size after expanding
i_size:
- write
- mapwrite [1]
- truncate
- fallocate
* preallocate
* zero_range
* insert_range
* collapse_range
- clone_range (doesn’t support in f2fs)
- copy_range (doesn’t support in f2fs)
[1] https://man7.org/linux/man-pages/man2/mmap.2.html 'BUG section'
Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 8bd25b61c5 ("smb: client: set correct d_type for reparse DFS/DFSR
and mount point") deduplicated assignment of fattr->cf_dtype member from
all places to end of the function cifs_reparse_point_to_fattr(). The only
one missing place which was not deduplicated is wsl_to_fattr(). Fix it.
Fixes: 8bd25b61c5 ("smb: client: set correct d_type for reparse DFS/DFSR and mount point")
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When mounting a share with kerberos authentication with multichannel
support, share mounts correctly, but fails to create secondary
channels. This occurs because the hostname is not populated when
adding the channels. The hostname is necessary for the userspace
cifs.upcall program to retrieve the required credentials and pass
it back to kernel, without hostname secondary channels fails
establish.
Cc: stable@vger.kernel.org
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Reported-by: xfuren <xfuren@gmail.com>
Link: https://bugzilla.samba.org/show_bug.cgi?id=15824
Signed-off-by: Steve French <stfrench@microsoft.com>
Previously, file operations on a file-backed mount used the current
process' credentials to access the backing FD. Attempting to do so on
Android lead to SELinux denials, as ACL rules on the backing file (e.g.
/system/apex/foo.apex) is restricted to a small set of process.
Arguably, this error is redundant and leaking implementation details, as
access to files on a mount is already ACL'ed by path.
Instead, override to use the opener's cred when accessing the backing
file. This makes the behavior similar to a loop-backed mount, which
uses kworker cred when accessing the backing file and does not cause
SELinux denials.
Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://lore.kernel.org/r/20250612-b4-erofs-impersonate-v1-1-8ea7d6f65171@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The normal fsck code doesn't call key_visible_in_snapshot() with an
empty list of snapshot IDs seen (the current snapshot ID will always be
on the list), but str_hash_repair_key() ->
bch2_get_snapshot_overwrites() can, and that's totally fine as long as
we check for it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>