2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

8552 Commits

Author SHA1 Message Date
Linus Torvalds
155a3c003e - dm-bufio: fix scheduling in atomic
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaHU0NhQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbYuhAP9E3m1AlDYfwP1ZOwv0FGXBVtiGFlNw
 n9HMdwmNBbiMXQD+MxhLAPfly1oot4qUHy7akqK39ANkwlWLDZgpAcI2dA0=
 =kh23
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mikulas Patocka:

 - dm-bufio: fix scheduling in atomic

* tag 'for-6.16/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-bufio: fix sched in atomic context
2025-07-14 19:25:28 -07:00
Linus Torvalds
40f92e79b0 block-6.16-20250710
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhwb8QQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuDTEAC5J4noilx4TRpKQ0gp3cF9KHvB2wAD0ry8
 1Y45lccZGrdUyWBnB7KyvIKUHt4MVk5Lw4d3vkv1Shx6XesW35hbCOI2W7UPsMsL
 nEBYJcroNNKlTlx9TJazVs0xmjF6G7JwaYXD6CVNLkjAQXxdeGst2Or15vhD4soz
 3nmwFAyP3sEU7ESRNZ53UaNaM2KW0BBNef+jcFn9MOdSZcilePY7ckh74JzCc9Oc
 GIcH0eTRDdfPi3TteLu/2VMNjpogX+9LY41r3laSKwSgEcYmj+pPFLuqjU6A82hg
 dT8FWJLR+GuUWTs9B7FuWWpmk7uwOPrIadSQo2DcTdiBSvBYuGv+0BPIxq1kfykn
 cUjresj49q2hNAjBK71iEDycZR+W+pn864r1mJg+8pASoKKyNX2/3iTvQj57RwFO
 phoICyxr37WxCYQMcTXYwPcYD8BnF7mTJIDYDFti4BY1w/dUwlSvsbfI9Zk1rxIH
 VumZzML0nhfTbEsq+QVOTZ6bq0hn71EmVONLamM1LdaoAh6PZU2CMduJuPjEgCNz
 I73xzc4MlshOZYBidiq++1yFnRX64pB6jPi2omu31PjXMd1ZZaZWENh0OGFp4zHX
 8yCmJoWTs8BXy5v74tJWxMTvShfMlqFBuTQlRexh1kala4IdngQrZjO6vBU2pw2C
 4orH43oFdA==
 =m0St
 -----END PGP SIGNATURE-----

Merge tag 'block-6.16-20250710' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - MD changes via Yu:
     - fix UAF due to stack memory used for bio mempool (Jinchao)
     - fix raid10/raid1 nowait IO error path (Nigel and Qixing)
     - fix kernel crash from reading bitmap sysfs entry (Håkon)

 - Fix for a UAF in the nbd connect error path

 - Fix for blocksize being bigger than pagesize, if THP isn't enabled

* tag 'block-6.16-20250710' of git://git.kernel.dk/linux:
  block: reject bs > ps block devices when THP is disabled
  nbd: fix uaf in nbd_genl_connect() error path
  md/md-bitmap: fix GPF in bitmap_get_stats()
  md/raid1,raid10: strip REQ_NOWAIT from member bios
  raid10: cleanup memleak at raid10_make_request
  md/raid1: Fix stack memory use after return in raid1_reshape
2025-07-11 10:35:54 -07:00
Sheng Yong
b1bf1a782f dm-bufio: fix sched in atomic context
If "try_verify_in_tasklet" is set for dm-verity, DM_BUFIO_CLIENT_NO_SLEEP
is enabled for dm-bufio. However, when bufio tries to evict buffers, there
is a chance to trigger scheduling in spin_lock_bh, the following warning
is hit:

BUG: sleeping function called from invalid context at drivers/md/dm-bufio.c:2745
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 123, name: kworker/2:2
preempt_count: 201, expected: 0
RCU nest depth: 0, expected: 0
4 locks held by kworker/2:2/123:
 #0: ffff88800a2d1548 ((wq_completion)dm_bufio_cache){....}-{0:0}, at: process_one_work+0xe46/0x1970
 #1: ffffc90000d97d20 ((work_completion)(&dm_bufio_replacement_work)){....}-{0:0}, at: process_one_work+0x763/0x1970
 #2: ffffffff8555b528 (dm_bufio_clients_lock){....}-{3:3}, at: do_global_cleanup+0x1ce/0x710
 #3: ffff88801d5820b8 (&c->spinlock){....}-{2:2}, at: do_global_cleanup+0x2a5/0x710
Preemption disabled at:
[<0000000000000000>] 0x0
CPU: 2 UID: 0 PID: 123 Comm: kworker/2:2 Not tainted 6.16.0-rc3-g90548c634bd0 #305 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: dm_bufio_cache do_global_cleanup
Call Trace:
 <TASK>
 dump_stack_lvl+0x53/0x70
 __might_resched+0x360/0x4e0
 do_global_cleanup+0x2f5/0x710
 process_one_work+0x7db/0x1970
 worker_thread+0x518/0xea0
 kthread+0x359/0x690
 ret_from_fork+0xf3/0x1b0
 ret_from_fork_asm+0x1a/0x30
 </TASK>

That can be reproduced by:

  veritysetup format --data-block-size=4096 --hash-block-size=4096 /dev/vda /dev/vdb
  SIZE=$(blockdev --getsz /dev/vda)
  dmsetup create myverity -r --table "0 $SIZE verity 1 /dev/vda /dev/vdb 4096 4096 <data_blocks> 1 sha256 <root_hash> <salt> 1 try_verify_in_tasklet"
  mount /dev/dm-0 /mnt -o ro
  echo 102400 > /sys/module/dm_bufio/parameters/max_cache_size_bytes
  [read files in /mnt]

Cc: stable@vger.kernel.org	# v6.4+
Fixes: 450e8dee51 ("dm bufio: improve concurrent IO performance")
Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com>
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-07-10 16:48:50 +02:00
Håkon Bugge
c17fb542db md/md-bitmap: fix GPF in bitmap_get_stats()
The commit message of commit 6ec1f02394 ("md/md-bitmap: fix stats
collection for external bitmaps") states:

    Remove the external bitmap check as the statistics should be
    available regardless of bitmap storage location.

    Return -EINVAL only for invalid bitmap with no storage (neither in
    superblock nor in external file).

But, the code does not adhere to the above, as it does only check for
a valid super-block for "internal" bitmaps. Hence, we observe:

Oops: GPF, probably for non-canonical address 0x1cd66f1f40000028
RIP: 0010:bitmap_get_stats+0x45/0xd0
Call Trace:

 seq_read_iter+0x2b9/0x46a
 seq_read+0x12f/0x180
 proc_reg_read+0x57/0xb0
 vfs_read+0xf6/0x380
 ksys_read+0x6d/0xf0
 do_syscall_64+0x8c/0x1b0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

We fix this by checking the existence of a super-block for both the
internal and external case.

Fixes: 6ec1f02394 ("md/md-bitmap: fix stats collection for external bitmaps")
Cc: stable@vger.kernel.org
Reported-by: Gerald Gibson <gerald.gibson@oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://lore.kernel.org/linux-raid/20250702091035.2061312-1-haakon.bugge@oracle.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-05 19:36:50 +08:00
Zheng Qixing
5fa31c4992 md/raid1,raid10: strip REQ_NOWAIT from member bios
RAID layers don't implement proper non-blocking semantics for
REQ_NOWAIT, making the flag potentially misleading when propagated
to member disks.

This patch clear REQ_NOWAIT from cloned bios in raid1/raid10. Retain
original bio's REQ_NOWAIT flag for upper layer error handling.

Maybe we can implement non-blocking I/O handling mechanisms within
RAID in future work.

Fixes: 9f346f7d4e ("md/raid1,raid10: don't handle IO error for
REQ_RAHEAD and REQ_NOWAIT")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250702102341.1969154-1-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-05 19:33:46 +08:00
Nigel Croxon
43806c3d5b raid10: cleanup memleak at raid10_make_request
If raid10_read_request or raid10_write_request registers a new
request and the REQ_NOWAIT flag is set, the code does not
free the malloc from the mempool.

unreferenced object 0xffff8884802c3200 (size 192):
   comm "fio", pid 9197, jiffies 4298078271
   hex dump (first 32 bytes):
     00 00 00 00 00 00 00 00 88 41 02 00 00 00 00 00  .........A......
     08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
   backtrace (crc c1a049a2):
     __kmalloc+0x2bb/0x450
     mempool_alloc+0x11b/0x320
     raid10_make_request+0x19e/0x650 [raid10]
     md_handle_request+0x3b3/0x9e0
     __submit_bio+0x394/0x560
     __submit_bio_noacct+0x145/0x530
     submit_bio_noacct_nocheck+0x682/0x830
     __blkdev_direct_IO_async+0x4dc/0x6b0
     blkdev_read_iter+0x1e5/0x3b0
     __io_read+0x230/0x1110
     io_read+0x13/0x30
     io_issue_sqe+0x134/0x1180
     io_submit_sqes+0x48c/0xe90
     __do_sys_io_uring_enter+0x574/0x8b0
     do_syscall_64+0x5c/0xe0
     entry_SYSCALL_64_after_hwframe+0x76/0x7e

V4: changing backing tree to see if CKI tests will pass.
The patch code has not changed between any versions.

Fixes: c9aa889b03 ("md: raid10 add nowait support")
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Link: https://lore.kernel.org/linux-raid/c0787379-9caa-42f3-b5fc-369aed784400@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-05 19:30:41 +08:00
Wang Jinchao
d67ed2ccd2 md/raid1: Fix stack memory use after return in raid1_reshape
In the raid1_reshape function, newpool is
allocated on the stack and assigned to conf->r1bio_pool.
This results in conf->r1bio_pool.wait.head pointing
to a stack address.
Accessing this address later can lead to a kernel panic.

Example access path:

raid1_reshape()
{
	// newpool is on the stack
	mempool_t newpool, oldpool;
	// initialize newpool.wait.head to stack address
	mempool_init(&newpool, ...);
	conf->r1bio_pool = newpool;
}

raid1_read_request() or raid1_write_request()
{
	alloc_r1bio()
	{
		mempool_alloc()
		{
			// if pool->alloc fails
			remove_element()
			{
				--pool->curr_nr;
			}
		}
	}
}

mempool_free()
{
	if (pool->curr_nr < pool->min_nr) {
		// pool->wait.head is a stack address
		// wake_up() will try to access this invalid address
		// which leads to a kernel panic
		return;
		wake_up(&pool->wait);
	}
}

Fix:
reinit conf->r1bio_pool.wait after assigning newpool.

Fixes: afeee514ce ("md: convert to bioset_init()/mempool_init()")
Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250612112901.3023950-1-wangjinchao600@gmail.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-05 19:17:37 +08:00
Linus Torvalds
78f4e737a5 - dm-crypt: fix a crash on 32-bit machines
- dm-raid: replace "rdev" with correct loop variable name "r"
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaFl3iRQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbcRMAP92ueTp0NFJr9dJne79HbhpJkBAS+b+
 25/qycKPv2XDfwD/c3/e3sBOhTIK8PohFR7lR62NepdfrOFVaaKubmNUlAU=
 =FD8P
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mikulas Patocka:

 - dm-crypt: fix a crash on 32-bit machines

 - dm-raid: replace "rdev" with correct loop variable name "r"

* tag 'for-6.16/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-raid: fix variable in journal device check
  dm-crypt: Extend state buffer size in crypt_iv_lmk_one
2025-06-23 15:02:57 -07:00
Heinz Mauelshagen
db53805156 dm-raid: fix variable in journal device check
Replace "rdev" with correct loop variable name "r".

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 63c32ed4af ("dm raid: add raid4/5/6 journaling support")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-06-23 16:42:37 +02:00
Herbert Xu
b872f562c8 dm-crypt: Extend state buffer size in crypt_iv_lmk_one
Add a macro CRYPTO_MD5_STATESIZE for the Crypto API export state
size of md5 and use that in dm-crypt instead of relying on the
size of struct md5_state (the latter is currently undergoing a
transition and may shrink).

This commit fixes a crash on 32-bit machines:
Oops: Oops: 0000 [#1] SMP
CPU: 1 UID: 0 PID: 12 Comm: kworker/u16:0 Not tainted 6.16.0-rc2+ #993 PREEMPT(full)
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Workqueue: kcryptd-254:0-1 kcryptd_crypt [dm_crypt]
EIP: __crypto_shash_export+0xf/0x90
Code: 4a c1 c7 40 20 a0 b4 4a c1 81 cf 0e 00 04 08 89 78 50 e9 2b ff ff ff 8d 74 26 00 55 89 e5 57 56 53 89 c3 89 d6 8b 00 8b 40 14 <8b> 50 fc f6 40 13 01 74 04 4a 2b 50 14 85 c9 74 10 89 f2 89 d8 ff
EAX: 303a3435 EBX: c3007c90 ECX: 00000000 EDX: c3007c38
ESI: c3007c38 EDI: c3007c90 EBP: c3007bfc ESP: c3007bf0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010216
CR0: 80050033 CR2: 303a3431 CR3: 04fbe000 CR4: 00350e90
Call Trace:
 crypto_shash_export+0x65/0xc0
 crypt_iv_lmk_one+0x106/0x1a0 [dm_crypt]

Fixes: efd62c8552 ("crypto: md5-generic - Use API partial block handling")
Reported-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Milan Broz <gmazyland@gmail.com>
Closes: https://lore.kernel.org/linux-crypto/f1625ddc-e82e-4b77-80c2-dc8e45b54848@gmail.com/T/
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-06-23 13:50:02 +02:00
Kuan-Wei Chiu
95b2e31e17 bcache: remove unnecessary select MIN_HEAP
After reverting the transition to the generic min heap library, bcache no
longer depends on MIN_HEAP.  The select entry can be removed to reduce
code size and shrink the kernel's attack surface.

This change effectively reverts the bcache-related part of commit
92a8b224b8 ("lib/min_heap: introduce non-inline versions of min heap API
functions").

This is part of a series of changes to address a performance regression
caused by the use of the generic min_heap implementation.

As reported by Robert, bcache now suffers from latency spikes, with P100
(max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. 
These regressions degrade bcache's effectiveness as a low-latency cache
layer and lead to frequent timeouts and application stalls in production
environments.

Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20250614202353.1632957-4-visitorckw@gmail.com
Fixes: 866898efbb ("bcache: remove heap-related macros and switch to generic min_heap")
Fixes: 92a8b224b8 ("lib/min_heap: introduce non-inline versions of min heap API functions")
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reported-by: Robert Pang <robertpang@google.com>
Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com
Acked-by: Coly Li <colyli@kernel.org>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19 20:48:03 -07:00
Kuan-Wei Chiu
48fd7ebe00 Revert "bcache: remove heap-related macros and switch to generic min_heap"
This reverts commit 866898efbb.

The generic bottom-up min_heap implementation causes performance
regression in invalidate_buckets_lru(), a hot path in bcache.  Before the
cache is fully populated, new_bucket_prio() often returns zero, leading to
many equal comparisons.  In such cases, bottom-up sift_down performs up to
2 * log2(n) comparisons, while the original top-down approach completes
with just O() comparisons, resulting in a measurable performance gap.

The performance degradation is further worsened by the non-inlined
min_heap API functions introduced in commit 92a8b224b8 ("lib/min_heap:
introduce non-inline versions of min heap API functions"), adding function
call overhead to this critical path.

As reported by Robert, bcache now suffers from latency spikes, with P100
(max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. 
These regressions degrade bcache's effectiveness as a low-latency cache
layer and lead to frequent timeouts and application stalls in production
environments.

This revert aims to restore bcache's original low-latency behavior.

Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20250614202353.1632957-3-visitorckw@gmail.com
Fixes: 866898efbb ("bcache: remove heap-related macros and switch to generic min_heap")
Fixes: 92a8b224b8 ("lib/min_heap: introduce non-inline versions of min heap API functions")
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reported-by: Robert Pang <robertpang@google.com>
Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com
Acked-by: Coly Li <colyli@kernel.org>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19 20:48:03 -07:00
Kuan-Wei Chiu
845f1f2d69 Revert "bcache: update min_heap_callbacks to use default builtin swap"
Patch series "bcache: Revert min_heap migration due to performance
regression".

This patch series reverts the migration of bcache from its original heap
implementation to the generic min_heap library.  While the original change
aimed to simplify the code and improve maintainability, it introduced a
severe performance regression in real-world scenarios.

As reported by Robert, systems using bcache now suffer from periodic
latency spikes, with P100 (max) latency increasing from 600 ms to 2.4
seconds every 5 minutes.  This degrades bcache's value as a low-latency
caching layer, and leads to frequent timeouts and application stalls in
production environments.

The primary cause of this regression is the behavior of the generic
min_heap implementation's bottom-up sift_down, which performs up to 2 *
log2(n) comparisons when many elements are equal.  The original top-down
variant used by bcache only required O(1) comparisons in such cases.  The
issue was further exacerbated by commit 92a8b224b8 ("lib/min_heap:
introduce non-inline versions of min heap API functions"), which
introduced non-inlined versions of the min_heap API, adding function call
overhead to a performance-critical hot path.


This patch (of 3):

This reverts commit 3d8a9a1c35.

Although removing the custom swap function simplified the code, this
change is part of a broader migration to the generic min_heap API that
introduced significant performance regressions in bcache.

As reported by Robert, bcache now suffers from latency spikes, with P100
(max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. 
These regressions degrade bcache's effectiveness as a low-latency cache
layer and lead to frequent timeouts and application stalls in production
environments.

This revert is part of a series of changes to restore previous performance
by undoing the min_heap transition.

Link: https://lkml.kernel.org/r/20250614202353.1632957-1-visitorckw@gmail.com
Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20250614202353.1632957-2-visitorckw@gmail.com
Fixes: 866898efbb ("bcache: remove heap-related macros and switch to generic min_heap")
Fixes: 92a8b224b8 ("lib/min_heap: introduce non-inline versions of min heap API functions")
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reported-by: Robert Pang <robertpang@google.com>
Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com
Acked-by: Coly Li <colyli@kernel.org>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19 20:48:02 -07:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
6d8854216e block-6.16-20250606
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhC7/UQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps+6D/9BOhkMyMkUF9LAev4PBNE+x3aftjl7Y1AY
 EHv2vozb4nDwXIaalG4qGUhprz2+z+hqxYjmnlOAsqbixhcSzKK5z9rjxyDka776
 x03vfvKXaXZUG7XN7ENY8sJnLx4QJ0nh4+0gzT9yDyq2vKvPFLEKweNOxKDKCSbE
 31vGoLFwjltp74hX+Qrnj1KMaTLgvAaV0eXKWlbX7Iiw6GFVm200zb27gth6U8bV
 WQAmjSkFQ0daHtAWmXIVy7hrXiCqe8D6YPKvBXnQ4cfKVbgG0HHDuTmQLpKGzfMi
 rr24MU5vZjt6OsYalceiTtifSUcf/I2+iFV7HswOk9kpOY5A2ylsWawRP2mm4PDI
 nJE3LaSTRpEvs5kzPJ2kr8Zp4/uvF6ehSq8Y9w52JekmOzxusLcRcswezaO00EI0
 32uuK+P505EGTcCBTrEdtaI6k7zzQEeVoIpxqvMhNRG/s5vzvIV3eVrALu2HSDma
 P3paEdx7PwJla3ndmdChfh1vUR3TW3gWoZvoNCVmJzNCnLEAScTS2NsiQeEjy8zs
 20IGsrRgIqt9KR8GZ2zj1ZOM47Cg0dIU3pbbA2Ja71wx4TYXJCSFFRK7mzDtXYlY
 BWOix/Dks8tk118cwuxnT+IiwmWDMbDZKnygh+4tiSyrs0IszeekRADLUu03C0Ve
 Dhpljqf3zA==
 =gs32
 -----END PGP SIGNATURE-----

Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - NVMe pull request via Christoph:
      - TCP error handling fix (Shin'ichiro Kawasaki)
      - TCP I/O stall handling fixes (Hannes Reinecke)
      - fix command limits status code (Keith Busch)
      - support vectored buffers also for passthrough (Pavel Begunkov)
      - spelling fixes (Yi Zhang)

 - MD pull request via Yu:
      - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10
      - fix max_write_behind setting for dm-raid
      - some minor cleanups

 - Integrity data direction fix and cleanup

 - bcache NULL pointer fix

 - Fix for loop missing write start/end handling

 - Decouple hardware queues and IO threads in ublk

 - Slew of ublk selftests additions and updates

* tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits)
  nvme: spelling fixes
  nvme-tcp: fix I/O stalls on congested sockets
  nvme-tcp: sanitize request list handling
  nvme-tcp: remove tag set when second admin queue config fails
  nvme: enable vectored registered bufs for passthrough cmds
  nvme: fix implicit bool to flags conversion
  nvme: fix command limits status code
  selftests: ublk: kublk: improve behavior on init failure
  block: flip iter directions in blk_rq_integrity_map_user()
  block: drop direction param from bio_integrity_copy_user()
  selftests: ublk: cover PER_IO_DAEMON in more stress tests
  Documentation: ublk: document UBLK_F_PER_IO_DAEMON
  selftests: ublk: add stress test for per io daemons
  selftests: ublk: add functional test for per io daemons
  selftests: ublk: kublk: decouple ublk_queues from ublk server threads
  selftests: ublk: kublk: move per-thread data out of ublk_queue
  selftests: ublk: kublk: lift queue initialization out of thread
  selftests: ublk: kublk: tie sqe allocation to io instead of queue
  selftests: ublk: kublk: plumb q_id in io_uring user_data
  ublk: have a per-io daemon instead of a per-queue daemon
  ...
2025-06-06 13:12:50 -07:00
Linus Torvalds
3c727285f1 - dm: better error handling when reloading a table
- dm-delay: don't busy-wait in kthread
 
 - dm: use use generic disable_* functions instead of open coding them
 
 - dm: lock queue limits when reading them
 
 - dm-verity: use softirq context only when !need_resched()
 
 - dm-bufio: remove maximum age based eviction
 
 - dm: remove unneeded kvfree from alloc_targets
 
 - dm-flakey: various fixes
 
 - dm-mpath: interface for explicit probing of active paths
 
 - dm: fix BLK_FEAT_ATOMIC_WRITES
 
 - dm: pass through operations on wrapped inline crypto keys
 
 - dm vdo indexer: don't read request structure after enqueuing
 
 - dm-zone: Use bdev_*() helper functions where applicable
 
 - dm-mpath: replace spin_lock_irqsave with spin_lock_irq
 
 - dm-mirror: fix a tiny race condition
 
 - dm-verity: fix a memory leak if some arguments are specified multiple times
 
 - dm-stripe: small code cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaD8vFBQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbYGFAQC/V62PzDUa326WSdvwhtYe6jphInlW
 ZSmh37L4MIcV2wD/S8UqIaC9GSakee6jEWBTRiDqNZNEOIWhbd7f6gnTOQ0=
 =eYUe
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mikulas Patocka:

 - better error handling when reloading a table

 - use use generic disable_* functions instead of open coding them

 - lock queue limits when reading them

 - remove unneeded kvfree from alloc_targets

 - fix BLK_FEAT_ATOMIC_WRITES

 - pass through operations on wrapped inline crypto keys

 - dm-verity:
     - use softirq context only when !need_resched()
     - fix a memory leak if some arguments are specified multiple times

 - dm-mpath:
    - interface for explicit probing of active paths
    - replace spin_lock_irqsave with spin_lock_irq

 - dm-delay: don't busy-wait in kthread

 - dm-bufio: remove maximum age based eviction

 - dm-flakey: various fixes

 - vdo indexer: don't read request structure after enqueuing

 - dm-zone: Use bdev_*() helper functions where applicable

 - dm-mirror: fix a tiny race condition

 - dm-stripe: small code cleanup

* tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits)
  dm-stripe: small code cleanup
  dm-verity: fix a memory leak if some arguments are specified multiple times
  dm-mirror: fix a tiny race condition
  dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock
  dm mpath: replace spin_lock_irqsave with spin_lock_irq
  dm-mpath: Don't grab work_mutex while probing paths
  dm-zone: Use bdev_*() helper functions where applicable
  dm vdo indexer: don't read request structure after enqueuing
  dm: pass through operations on wrapped inline crypto keys
  blk-crypto: export wrapped key functions
  dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits
  dm mpath: Interface for explicit probing of active paths
  dm: Allow .prepare_ioctl to handle ioctls directly
  dm-flakey: make corrupting read bios work
  dm-flakey: remove useless ERROR_READS check in flakey_end_io
  dm-flakey: error all IOs when num_features is absent
  dm-flakey: Clean up parsing messages
  dm: remove unneeded kvfree from alloc_targets
  dm-bufio: remove maximum age based eviction
  dm-verity: use softirq context only when !need_resched()
  ...
2025-06-03 15:54:46 -07:00
Mikulas Patocka
9f2f6316d7 dm-stripe: small code cleanup
This commit doesn't fix any bug, it is just code cleanup. Use the
function format_dev_t instead of sprintf, because format_dev_t does the
same thing.

Remove the useless memset call.

An unsigned integer can take at most 10 digits, so extend the array size
to 22. (note that because the range of minor and major numbers is limited,
the size 16 could not be exceeded, thus this function couldn't write
beyond string end)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-06-03 19:06:32 +02:00
Mikulas Patocka
66be40a14e dm-verity: fix a memory leak if some arguments are specified multiple times
If some of the arguments "check_at_most_once", "ignore_zero_blocks",
"use_fec_from_device", "root_hash_sig_key_desc" were specified more than
once on the target line, a memory leak would happen.

This commit fixes the memory leak. It also fixes error handling in
verity_verify_sig_parse_opt_args.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-06-03 19:01:42 +02:00
Mikulas Patocka
829451beae dm-mirror: fix a tiny race condition
There's a tiny race condition in dm-mirror. The functions queue_bio and
write_callback grab a spinlock, add a bio to the list, drop the spinlock
and wake up the mirrord thread that processes bios in the list.

It may be possible that the mirrord thread processes the bio just after
spin_unlock_irqrestore is called, before wakeup_mirrord. This spurious
wake-up is normally harmless, however if the device mapper device is
unloaded just after the bio was processed, it may be possible that
wakeup_mirrord(ms) uses invalid "ms" pointer.

Fix this bug by moving wakeup_mirrord inside the spinlock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-06-03 19:01:23 +02:00
Benjamin Marzinski
85f6d5b729 dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock
dm_set_device_limits() should check q->limits.features for
BLK_FEAT_ATOMIC_WRITES while holding q->limits_lock, like it does for
the rest of the queue limits.

Fixes: b7c18b17a1 ("dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-06-02 14:24:51 +02:00
Linus Torvalds
7d4e49a77d - The 3 patch series "hung_task: extend blocking task stacktrace dump to
semaphore" from Lance Yang enhances the hung task detector.  The
   detector presently dumps the blocking tasks's stack when it is blocked
   on a mutex.  Lance's series extends this to semaphores.
 
 - The 2 patch series "nilfs2: improve sanity checks in dirty state
   propagation" from Wentao Liang addresses a couple of minor flaws in
   nilfs2.
 
 - The 2 patch series "scripts/gdb: Fixes related to lx_per_cpu()" from
   Illia Ostapyshyn fixes a couple of issues in the gdb scripts.
 
 - The 9 patch series "Support kdump with LUKS encryption by reusing LUKS
   volume keys" from Coiby Xu addresses a usability problem with kdump.
   When the dump device is LUKS-encrypted, the kdump kernel may not have
   the keys to the encrypted filesystem.  A full writeup of this is in the
   series [0/N] cover letter.
 
 - The 2 patch series "sysfs: add counters for lockups and stalls" from
   Max Kellermann adds /sys/kernel/hardlockup_count and
   /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count.
 
 - The 3 patch series "fork: Page operation cleanups in the fork code"
   from Pasha Tatashin implements a number of code cleanups in fork.c.
 
 - The 3 patch series "scripts/gdb/symbols: determine KASLR offset on
   s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in
   the gdb scripts.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDuCvQAKCRDdBJ7gKXxA
 jrkxAQCnFAp/uK9ckkbN4nfpJ0+OMY36C+A+dawSDtuRsIkXBAEAq3e6MNAUdg5W
 Ca0cXdgSIq1Op7ZKEA+66Km6Rfvfow8=
 =g45L
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "hung_task: extend blocking task stacktrace dump to semaphore" from
   Lance Yang enhances the hung task detector.

   The detector presently dumps the blocking tasks's stack when it is
   blocked on a mutex. Lance's series extends this to semaphores

 - "nilfs2: improve sanity checks in dirty state propagation" from
   Wentao Liang addresses a couple of minor flaws in nilfs2

 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn
   fixes a couple of issues in the gdb scripts

 - "Support kdump with LUKS encryption by reusing LUKS volume keys" from
   Coiby Xu addresses a usability problem with kdump.

   When the dump device is LUKS-encrypted, the kdump kernel may not have
   the keys to the encrypted filesystem. A full writeup of this is in
   the series [0/N] cover letter

 - "sysfs: add counters for lockups and stalls" from Max Kellermann adds
   /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and
   /sys/kernel/rcu_stall_count

 - "fork: Page operation cleanups in the fork code" from Pasha Tatashin
   implements a number of code cleanups in fork.c

 - "scripts/gdb/symbols: determine KASLR offset on s390 during early
   boot" from Ilya Leoshkevich fixes some s390 issues in the gdb
   scripts

* tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits)
  llist: make llist_add_batch() a static inline
  delayacct: remove redundant code and adjust indentation
  squashfs: add optional full compressed block caching
  crash_dump, nvme: select CONFIGFS_FS as built-in
  scripts/gdb/symbols: determine KASLR offset on s390 during early boot
  scripts/gdb/symbols: factor out pagination_off()
  scripts/gdb/symbols: factor out get_vmlinux()
  kernel/panic.c: format kernel-doc comments
  mailmap: update and consolidate Casey Connolly's name and email
  nilfs2: remove wbc->for_reclaim handling
  fork: define a local GFP_VMAP_STACK
  fork: check charging success before zeroing stack
  fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code
  fork: clean-up ifdef logic around stack allocation
  kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
  kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
  x86/crash: make the page that stores the dm crypt keys inaccessible
  x86/crash: pass dm crypt keys to kdump kernel
  Revert "x86/mm: Remove unused __set_memory_prot()"
  crash_dump: retrieve dm crypt keys in kdump kernel
  ...
2025-05-31 19:12:53 -07:00
Yu Kuai
01bf468c4e md/md-bitmap: remove parameter slot from bitmap_create()
All callers pass in '-1' for 'slot', hence it can be removed.

Link: https://lore.kernel.org/linux-raid/20250524061320.370630-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-30 15:47:23 +08:00
Yu Kuai
38f520a37d md/md-bitmap: cleanup bitmap_ops->startwrite()
bitmap_startwrite() always return 0, and the caller doesn't check return
value as well, hence change the method to void.

Also rename startwrite/endwrite to start_write/end_write, which is more in
line with the usual naming convention.

Link: https://lore.kernel.org/linux-raid/20250524061320.370630-4-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-30 15:47:23 +08:00
Yu Kuai
b886475804 md/dm-raid: remove max_write_behind setting limit
The comments said 'vaule in kB', while the value actually means the
number of write_behind IOs. And since md-bitmap will automatically
adjust the value to max COUNTER_MAX / 2, there is no need to fail
early.

Also move some macros that is only used md-bitmap.c.

Link: https://lore.kernel.org/linux-raid/20250524061320.370630-15-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-30 15:47:23 +08:00
Yu Kuai
2afe17794c md/md-bitmap: fix dm-raid max_write_behind setting
It's supposed to be COUNTER_MAX / 2, not COUNTER_MAX.

Link: https://lore.kernel.org/linux-raid/20250524061320.370630-14-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-30 15:47:23 +08:00
Yu Kuai
9f346f7d4e md/raid1,raid10: don't handle IO error for REQ_RAHEAD and REQ_NOWAIT
IO with REQ_RAHEAD or REQ_NOWAIT can fail early, even if the storage medium
is fine, hence record badblocks or remove the disk from array does not
make sense.

This problem if found by lvm2 test lvcreate-large-raid, where dm-zero
will fail read ahead IO directly.

Fixes: e879a0d9cb ("md/raid1,raid10: don't ignore IO flags")
Reported-and-tested-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/all/34fa755d-62c8-4588-8ee1-33cb1249bdf2@redhat.com/
Link: https://lore.kernel.org/linux-raid/20250527081407.3004055-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-05-30 15:46:45 +08:00
Linus Torvalds
48cfc5791d hardening updates for v6.16-rc1
- Update overflow helpers to ease refactoring of on-stack flex array
   instances (Gustavo A. R. Silva, Kees Cook)
 
 - lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo)
 
 - Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr)
 
 - Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh)
 
 - Add missed designated initializers now exposed by fixed randstruct
   (Nathan Chancellor, Kees Cook)
 
 - Document compilers versions for __builtin_dynamic_object_size
 
 - Remove ARM_SSP_PER_TASK GCC plugin
 
 - Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST
   builds
 
 - Kbuild: induce full rebuilds when dependencies change with GCC plugins,
   the Clang sanitizer .scl file, or the randstruct seed.
 
 - Kbuild: Switch from -Wvla to -Wvla-larger-than=1
 
 - Correct several __nonstring uses for -Wunterminated-string-initialization
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaDUq9gAKCRA2KwveOeQk
 u+ZCAQDhqpOE/yn5gfjyplIvaTtzj9CaW6g11AmPYrimJCuj3QD9G+0o35kzlXOw
 f0ZIj2U7LFNgbLos+20hQwhMFf1Zhgg=
 =OYzD
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:

 - Update overflow helpers to ease refactoring of on-stack flex array
   instances (Gustavo A. R. Silva, Kees Cook)

 - lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo)

 - Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr)

 - Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh)

 - Add missed designated initializers now exposed by fixed randstruct
   (Nathan Chancellor, Kees Cook)

 - Document compilers versions for __builtin_dynamic_object_size

 - Remove ARM_SSP_PER_TASK GCC plugin

 - Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST
   builds

 - Kbuild: induce full rebuilds when dependencies change with GCC
   plugins, the Clang sanitizer .scl file, or the randstruct seed.

 - Kbuild: Switch from -Wvla to -Wvla-larger-than=1

 - Correct several __nonstring uses for -Wunterminated-string-initialization

* tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
  Revert "hardening: Disable GCC randstruct for COMPILE_TEST"
  lib/tests: randstruct: Add deep function pointer layout test
  lib/tests: Add randstruct KUnit test
  randstruct: gcc-plugin: Remove bogus void member
  net: qede: Initialize qede_ll_ops with designated initializer
  scsi: qedf: Use designated initializer for struct qed_fcoe_cb_ops
  md/bcache: Mark __nonstring look-up table
  integer-wrap: Force full rebuild when .scl file changes
  randstruct: Force full rebuild when seed changes
  gcc-plugins: Force full rebuild when plugins change
  kbuild: Switch from -Wvla to -Wvla-larger-than=1
  hardening: simplify CONFIG_CC_HAS_COUNTED_BY
  overflow: Fix direct struct member initialization in _DEFINE_FLEX()
  kunit/overflow: Add tests for STACK_FLEX_ARRAY_SIZE() helper
  overflow: Add STACK_FLEX_ARRAY_SIZE() helper
  input/joystick: magellan: Mark __nonstring look-up table const
  watchdog: exar: Shorten identity name to fit correctly
  mod_devicetable: Enlarge the maximum platform_device_id name length
  overflow: Clarify expectations for getting DEFINE_FLEX variable sizes
  compiler_types: Identify compiler versions for __builtin_dynamic_object_size
  ...
2025-05-28 07:47:10 -07:00
Mingzhe Zou
208c1559c5 bcache: reserve more RESERVE_BTREE buckets to prevent allocator hang
Reported an IO hang and unrecoverable error in our testing environment.

After careful research, we found that bch_allocator_thread is stuck,
the call stack is as follows:
[<0>] __switch_to+0xbc/0x108
[<0>] __closure_sync+0x7c/0xbc [bcache]
[<0>] bch_prio_write+0x430/0x448 [bcache]
[<0>] bch_allocator_thread+0xb44/0xb70 [bcache]
[<0>] kthread+0x124/0x130
[<0>] ret_from_fork+0x10/0x18

Moreover, the RESERVE_BTREE type bucket slot are empty and journal_full
occurs at the same time.

When the cache disk is first used, the sb.nJournal_buckets defaults to 0.
So, only 8 RESERVE_BTREE type buckets are reserved. If RESERVE_BTREE type
buckets used up or btree_check_reserve() failed when request handle btree
split, the request will be repeatedly retried and wait for alloc thread to
fill in.

After the alloc thread fills the buckets, it will call bch_prio_write().
If journal_full occurs simultaneously at this time, journal_reclaim() and
btree_flush_write() will be called sequentially, journal_write cannot be
completed.

This is a low probability event, we believe that reserve more RESERVE_BTREE
buckets can avoid the worst situation.

Fixes: 682811b3ce ("bcache: fix for allocator and register thread race")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250527051601.74407-4-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27 07:38:19 -06:00
Robert Pang
5a08e49f23 bcache: remove unused constants
Remove constants MAX_NEED_GC and MAX_SAVE_PRIO in btree.c that have been unused
since initial commit.

Signed-off-by: Robert Pang <robertpang@google.com>
Signed-off-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250527051601.74407-3-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27 07:38:19 -06:00
Linggang Zeng
1e46ed947e bcache: fix NULL pointer in cache_set_flush()
1. LINE#1794 - LINE#1887 is some codes about function of
   bch_cache_set_alloc().
2. LINE#2078 - LINE#2142 is some codes about function of
   register_cache_set().
3. register_cache_set() will call bch_cache_set_alloc() in LINE#2098.

 1794 struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 1795 {
 ...
 1860         if (!(c->devices = kcalloc(c->nr_uuids, sizeof(void *), GFP_KERNEL)) ||
 1861             mempool_init_slab_pool(&c->search, 32, bch_search_cache) ||
 1862             mempool_init_kmalloc_pool(&c->bio_meta, 2,
 1863                                 sizeof(struct bbio) + sizeof(struct bio_vec) *
 1864                                 bucket_pages(c)) ||
 1865             mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size) ||
 1866             bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio),
 1867                         BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER) ||
 1868             !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) ||
 1869             !(c->moving_gc_wq = alloc_workqueue("bcache_gc",
 1870                                                 WQ_MEM_RECLAIM, 0)) ||
 1871             bch_journal_alloc(c) ||
 1872             bch_btree_cache_alloc(c) ||
 1873             bch_open_buckets_alloc(c) ||
 1874             bch_bset_sort_state_init(&c->sort, ilog2(c->btree_pages)))
 1875                 goto err;
                      ^^^^^^^^
 1876
 ...
 1883         return c;
 1884 err:
 1885         bch_cache_set_unregister(c);
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 1886         return NULL;
 1887 }
 ...
 2078 static const char *register_cache_set(struct cache *ca)
 2079 {
 ...
 2098         c = bch_cache_set_alloc(&ca->sb);
 2099         if (!c)
 2100                 return err;
                      ^^^^^^^^^^
 ...
 2128         ca->set = c;
 2129         ca->set->cache[ca->sb.nr_this_dev] = ca;
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ...
 2138         return NULL;
 2139 err:
 2140         bch_cache_set_unregister(c);
 2141         return err;
 2142 }

(1) If LINE#1860 - LINE#1874 is true, then do 'goto err'(LINE#1875) and
    call bch_cache_set_unregister()(LINE#1885).
(2) As (1) return NULL(LINE#1886), LINE#2098 - LINE#2100 would return.
(3) As (2) has returned, LINE#2128 - LINE#2129 would do *not* give the
    value to c->cache[], it means that c->cache[] is NULL.

LINE#1624 - LINE#1665 is some codes about function of cache_set_flush().
As (1), in LINE#1885 call
bch_cache_set_unregister()
---> bch_cache_set_stop()
     ---> closure_queue()
          -.-> cache_set_flush() (as below LINE#1624)

 1624 static void cache_set_flush(struct closure *cl)
 1625 {
 ...
 1654         for_each_cache(ca, c, i)
 1655                 if (ca->alloc_thread)
                          ^^
 1656                         kthread_stop(ca->alloc_thread);
 ...
 1665 }

(4) In LINE#1655 ca is NULL(see (3)) in cache_set_flush() then the
    kernel crash occurred as below:
[  846.712887] bcache: register_cache() error drbd6: cannot allocate memory
[  846.713242] bcache: register_bcache() error : failed to register device
[  846.713336] bcache: cache_set_free() Cache set 2f84bdc1-498a-4f2f-98a7-01946bf54287 unregistered
[  846.713768] BUG: unable to handle kernel NULL pointer dereference at 00000000000009f8
[  846.714790] PGD 0 P4D 0
[  846.715129] Oops: 0000 [#1] SMP PTI
[  846.715472] CPU: 19 PID: 5057 Comm: kworker/19:16 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-147.5.1.el8_1.5es.3.x86_64 #1
[  846.716082] Hardware name: ESPAN GI-25212/X11DPL-i, BIOS 2.1 06/15/2018
[  846.716451] Workqueue: events cache_set_flush [bcache]
[  846.716808] RIP: 0010:cache_set_flush+0xc9/0x1b0 [bcache]
[  846.717155] Code: 00 4c 89 a5 b0 03 00 00 48 8b 85 68 f6 ff ff a8 08 0f 84 88 00 00 00 31 db 66 83 bd 3c f7 ff ff 00 48 8b 85 48 ff ff ff 74 28 <48> 8b b8 f8 09 00 00 48 85 ff 74 05 e8 b6 58 a2 e1 0f b7 95 3c f7
[  846.718026] RSP: 0018:ffffb56dcf85fe70 EFLAGS: 00010202
[  846.718372] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  846.718725] RDX: 0000000000000001 RSI: 0000000040000001 RDI: 0000000000000000
[  846.719076] RBP: ffffa0ccc0f20df8 R08: ffffa0ce1fedb118 R09: 000073746e657665
[  846.719428] R10: 8080808080808080 R11: 0000000000000000 R12: ffffa0ce1fee8700
[  846.719779] R13: ffffa0ccc0f211a8 R14: ffffa0cd1b902840 R15: ffffa0ccc0f20e00
[  846.720132] FS:  0000000000000000(0000) GS:ffffa0ce1fec0000(0000) knlGS:0000000000000000
[  846.720726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.721073] CR2: 00000000000009f8 CR3: 00000008ba00a005 CR4: 00000000007606e0
[  846.721426] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  846.721778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  846.722131] PKRU: 55555554
[  846.722467] Call Trace:
[  846.722814]  process_one_work+0x1a7/0x3b0
[  846.723157]  worker_thread+0x30/0x390
[  846.723501]  ? create_worker+0x1a0/0x1a0
[  846.723844]  kthread+0x112/0x130
[  846.724184]  ? kthread_flush_work_fn+0x10/0x10
[  846.724535]  ret_from_fork+0x35/0x40

Now, check whether that ca is NULL in LINE#1655 to fix the issue.

Signed-off-by: Linggang Zeng <linggang.zeng@easystack.cn>
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250527051601.74407-2-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27 07:38:19 -06:00
Linus Torvalds
6f59de9bc0 for-6.16/block-20250523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X
 MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q
 IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy
 fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO
 xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW
 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH
 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R
 Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn
 i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X
 chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8
 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb
 Y6NDJw5+kQ==
 =1PNK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - ublk updates:
      - Add support for updating the size of a ublk instance
      - Zero-copy improvements
      - Auto-registering of buffers for zero-copy
      - Series simplifying and improving GET_DATA and request lookup
      - Series adding quiesce support
      - Lots of selftests additions
      - Various cleanups

 - NVMe updates via Christoph:
      - add per-node DMA pools and use them for PRP/SGL allocations
        (Caleb Sander Mateos, Keith Busch)
      - nvme-fcloop refcounting fixes (Daniel Wagner)
      - support delayed removal of the multipath node and optionally
        support the multipath node for private namespaces (Nilay Shroff)
      - support shared CQs in the PCI endpoint target code (Wilfred
        Mallawa)
      - support admin-queue only authentication (Hannes Reinecke)
      - use the crc32c library instead of the crypto API (Eric Biggers)
      - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
        Reinecke, Leon Romanovsky, Gustavo A. R. Silva)

 - MD updates via Yu:
      - Fix that normal IO can be starved by sync IO, found by mkfs on
        newly created large raid5, with some clean up patches for bdev
        inflight counters

 - Clean up brd, getting rid of atomic kmaps and bvec poking

 - Add loop driver specifically for zoned IO testing

 - Eliminate blk-rq-qos calls with a static key, if not enabled

 - Improve hctx locking for when a plug has IO for multiple queues
   pending

 - Remove block layer bouncing support, which in turn means we can
   remove the per-node bounce stat as well

 - Improve blk-throttle support

 - Improve delay support for blk-throttle

 - Improve brd discard support

 - Unify IO scheduler switching. This should also fix a bunch of lockdep
   warnings we've been seeing, after enabling lockdep support for queue
   freezing/unfreezeing

 - Add support for block write streams via FDP (flexible data placement)
   on NVMe

 - Add a bunch of block helpers, facilitating the removal of a bunch of
   duplicated boilerplate code

 - Remove obsolete BLK_MQ pci and virtio Kconfig options

 - Add atomic/untorn write support to blktrace

 - Various little cleanups and fixes

* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
  selftests: ublk: add test for UBLK_F_QUIESCE
  ublk: add feature UBLK_F_QUIESCE
  selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
  traceevent/block: Add REQ_ATOMIC flag to block trace events
  ublk: run auto buf unregisgering in same io_ring_ctx with registering
  io_uring: add helper io_uring_cmd_ctx_handle()
  ublk: remove io argument from ublk_auto_buf_reg_fallback()
  ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
  selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
  selftests: ublk: support UBLK_F_AUTO_BUF_REG
  ublk: support UBLK_AUTO_BUF_REG_FALLBACK
  ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
  ublk: prepare for supporting to register request buffer automatically
  ublk: convert to refcount_t
  selftests: ublk: make IO & device removal test more stressful
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  nvme: introduce multipath_always_on module param
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme-pci: derive and better document max segments limits
  nvme-pci: use struct_size for allocation struct nvme_dev
  ...
2025-05-26 11:39:36 -07:00
Mikulas Patocka
050a3e71ce dm mpath: replace spin_lock_irqsave with spin_lock_irq
Replace spin_lock_irqsave/spin_unlock_irqrestore with
spin_lock_irq/spin_unlock_irq at places where it is known that interrupts
are enabled.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
2025-05-22 15:56:42 +02:00
Benjamin Marzinski
5c977f1023 dm-mpath: Don't grab work_mutex while probing paths
Grabbing the work_mutex keeps probe_active_paths() from running at the
same time as multipath_message(). The only messages that could interfere
with probing the paths are "disable_group", "enable_group", and
"switch_group". These messages could force multipath to pick a new
pathgroup while probe_active_paths() was probing the current pathgroup.
If the multipath device has a hardware handler, and it switches active
pathgroups while there is outstanding IO to a path device, it's possible
that IO to the path will fail, even if the path would be usable if it
was in the active pathgroup. To avoid this, do not clear the current
pathgroup for the *_group messages while probe_active_paths() is
running. Instead set a flag, and probe_active_paths() will clear the
current pathgroup when it finishes probing the paths. For this to work
correctly, multipath needs to check current_pg before next_pg in
choose_pgpath(), but before this patch next_pg was only ever set when
current_pg was cleared, so this doesn't change the current behavior when
paths aren't being probed. Even with this change, it is still possible
to switch pathgroups while the probe is running, but only if all the
paths have failed, and the probe function will skip them as well in this
case.

If multiple DM_MPATH_PROBE_PATHS requests are received at once, there is
no point in repeatedly issuing test IOs. Instead, the later probes
should wait for the current probe to complete. If current pathgroup is
still the same as the one that was just checked, the other probes should
skip probing and just check the number of valid paths.  Finally, probing
the paths should quit early if the multipath device is trying to
suspend, instead of continuing to issue test IOs, delaying the suspend.

While this patch will not change the behavior of existing multipath
users which don't use the DM_MPATH_PROBE_PATHS ioctl, when that ioctl
is used, the behavior of the "disable_group", "enable_group", and
"switch_group" messages can change subtly. When these messages return,
the next IO to the multipath device will no longer be guaranteed to
choose a new pathgroup. Instead, choosing a new pathgroup could be
delayed by an in-progress DM_MPATH_PROBE_PATHS ioctl. The userspace
multipath tools make no assumptions about what will happen to IOs after
sending these messages, so this change will not effect already released
versions of them, even if the DM_MPATH_PROBE_PATHS ioctl is run
alongside them.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-16 13:23:45 +02:00
Bart Van Assche
241b9b584d dm-zone: Use bdev_*() helper functions where applicable
Improve code readability by using bdev_is_zone_aligned() and
bdev_offset_from_zone_start() where applicable. No functionality
has been changed.

This patch is a reworked version of a patch from Pankaj Raghav.

See also https://lore.kernel.org/linux-block/20220923173618.6899-11-p.raghav@samsung.com/.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-15 15:55:07 +02:00
Matthew Sakai
3da732687d dm vdo indexer: don't read request structure after enqueuing
The function get_volume_page_protected may place a request on
a queue for another thread to process asynchronously. When this
happens, the volume should not read the request from the original
thread. This can not currently cause problems, due to the way
request processing is handled, but it is not safe in general.

Reviewed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-15 15:54:47 +02:00
Fedor Pchelkin
f3def8270c sort.h: hoist cmp_int() into generic header file
Deduplicate the same functionality implemented in several places by
moving the cmp_int() helper macro into linux/sort.h.

The macro performs a three-way comparison of the arguments mostly useful
in different sorting strategies and algorithms.

Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Coly Li <colyli@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Coly Li <colyli@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11 17:54:12 -07:00
Yu Kuai
752d0464b7 md: clean up accounting for issued sync IO
It's no longer used and can be removed, also remove the field
'gendisk->sync_io'.

Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-10-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10 16:14:22 +08:00
Yu Kuai
e5797ae703 md: fix is_mddev_idle()
If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflight sync_io
will be limited if the array is not idle.

However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.

Root cause is the following checking from is_mddev_idle():

t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2  = completed IO - issued sync IO
if (events2 - events1 > 64)

For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.

Fix this problem by changing the checking as following:

1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
   have IO completed.

Also change rdev->last_events to unsigned long to cleanup type casting.

Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10 16:13:31 +08:00
Yu Kuai
03720d82d7 md: add a new api sync_io_depth
Currently if sync speed is above speed_min and below speed_max,
md_do_sync() will wait for all sync IOs to be done before issuing new
sync IO, means sync IO depth is limited to just 1.

This limit is too low, in order to prevent sync speed drop conspicuously
after fixing is_mddev_idle() in the next patch, add a new api for
limiting sync IO depth, the default value is 32.

Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10 16:12:52 +08:00
Yu Kuai
7168be3c8a md: record dm-raid gendisk in mddev
Following patch will use gendisk to check if there are normal IO
completed or inflight, to fix a problem in mdraid that foreground IO
can be starved by background sync IO in later patches.

Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-7-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10 16:12:19 +08:00
Kees Cook
82d76bf938 md/bcache: Mark __nonstring look-up table
GCC 15's new -Wunterminated-string-initialization notices that the 16
character lookup table "zero_uuid" (which is not used as a C-String)
needs to be marked as "nonstring":

drivers/md/bcache/super.c: In function 'uuid_find_empty':
drivers/md/bcache/super.c:549:43: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Wunterminated-string-initialization]
  549 |         static const char zero_uuid[16] = "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
      |                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Add the annotation (since it is not used as a C-String), and switch the
initializer to an array of bytes rather than an empty initializer,
as preferred by Coly Li.

Suggested-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/lkml/389A9925-0990-422C-A1B3-0195FAA73288@coly.li/
Signed-off-by: Kees Cook <kees@kernel.org>
2025-05-08 09:42:06 -07:00
Christoph Hellwig
bd4e709b32 dm-integrity: use bio_add_virt_nofail
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it, and do the same for the
similar pattern using bio_add_page for adding the first segment after
a bio allocation as that can't fail either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:31:07 -06:00
Christoph Hellwig
9134124ce1 dm-bufio: use bio_add_virt_nofail
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:31:07 -06:00
Christoph Hellwig
23f5d69dfa bcache: use bio_add_virt_nofail
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:31:07 -06:00
Eric Biggers
e93912786e dm: pass through operations on wrapped inline crypto keys
Make the device-mapper layer pass through the derive_sw_secret,
import_key, generate_key, and prepare_key blk-crypto operations when all
underlying devices support hardware-wrapped inline crypto keys and are
passing through inline crypto support.

Commit ebc4176551 ("blk-crypto: add basic hardware-wrapped key
support") already made BLK_CRYPTO_KEY_TYPE_HW_WRAPPED be passed through
in the same way that the other crypto capabilities are.  But the wrapped
key support also includes additional operations in blk_crypto_ll_ops,
and the dm layer needs to implement those to pass them through.
derive_sw_secret is needed by fscrypt, while the other operations are
needed for the new blk-crypto ioctls to work on device-mapper devices
and not just the raw partitions.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-06 19:08:20 +02:00
Linus Torvalds
cccd033714 - dm: fix reading past the end of allocated memory
- dm: fix missing dm_put_live_table() in dm_keyslot_evict()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaBn8PhQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbVCUAQDDMCRu68hiL5SWai9YXhw40rPTuC7k
 e/zHIwRsObItgAD/YvRH1d85XBhQY5x3PCHa3j1u9q+S3uF4naG1n1afqw8=
 =L5lI
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mikulas Patocka:

 - fix reading past the end of allocated memory

 - fix missing dm_put_live_table() in dm_keyslot_evict()

* tag 'for-6.15/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: fix copying after src array boundaries
  dm: add missing unlock on in dm_keyslot_evict()
2025-05-06 08:14:20 -07:00
Tudor Ambarus
f1aff4bc19 dm: fix copying after src array boundaries
The blammed commit copied to argv the size of the reallocated argv,
instead of the size of the old_argv, thus reading and copying from
past the old_argv allocated memory.

Following BUG_ON was hit:
[    3.038929][    T1] kernel BUG at lib/string_helpers.c:1040!
[    3.039147][    T1] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
...
[    3.056489][    T1] Call trace:
[    3.056591][    T1]  __fortify_panic+0x10/0x18 (P)
[    3.056773][    T1]  dm_split_args+0x20c/0x210
[    3.056942][    T1]  dm_table_add_target+0x13c/0x360
[    3.057132][    T1]  table_load+0x110/0x3ac
[    3.057292][    T1]  dm_ctl_ioctl+0x424/0x56c
[    3.057457][    T1]  __arm64_sys_ioctl+0xa8/0xec
[    3.057634][    T1]  invoke_syscall+0x58/0x10c
[    3.057804][    T1]  el0_svc_common+0xa8/0xdc
[    3.057970][    T1]  do_el0_svc+0x1c/0x28
[    3.058123][    T1]  el0_svc+0x50/0xac
[    3.058266][    T1]  el0t_64_sync_handler+0x60/0xc4
[    3.058452][    T1]  el0t_64_sync+0x1b0/0x1b4
[    3.058620][    T1] Code: f800865e a9bf7bfd 910003fd 941f48aa (d4210000)
[    3.058897][    T1] ---[ end trace 0000000000000000 ]---
[    3.059083][    T1] Kernel panic - not syncing: Oops - BUG: Fatal exception

Fix it by copying the size of src, and not the size of dst, as it was.

Fixes: 5a2a6c4281 ("dm: always update the array size in realloc_argv on success")
Cc: stable@vger.kernel.org
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-06 14:06:59 +02:00
John Garry
b7c18b17a1 dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits
Feature flag BLK_FEAT_ATOMIC_WRITES is not being properly set for the
target queue limits, and this means that atomic writes are not being
enabled for any dm personalities.

When calling dm_set_device_limits() -> blk_stack_limits() ->
... -> blk_stack_atomic_writes_limits(), the bottom device limits
(which corresponds to intermediate target queue limits) does not have
BLK_FEAT_ATOMIC_WRITES set, and so atomic writes can never be enabled.

Typically such a flag would be inherited from the stacked device in
dm_set_device_limits() -> blk_stack_limits() via BLK_FEAT_INHERIT_MASK,
but BLK_FEAT_ATOMIC_WRITES is not inherited as it's preferred to manually
enable on a per-personality basis.

Set BLK_FEAT_ATOMIC_WRITES manually for the intermediate target queue
limits from the stacked device to get atomic writes working.

Fixes: 3194e36488 ("dm-table: atomic writes support")
Cc: stable@vger.kernel.org	# v6.14
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 12:02:39 +02:00
Kevin Wolf
7734fb4ad9 dm mpath: Interface for explicit probing of active paths
Multipath cannot directly provide failover for ioctls in the kernel
because it doesn't know what each ioctl means and which result could
indicate a path error. Userspace generally knows what the ioctl it
issued means and if it might be a path error, but neither does it know
which path the ioctl took nor does it necessarily have the privileges to
fail a path using the control device.

In order to allow userspace to address this situation, implement a
DM_MPATH_PROBE_PATHS ioctl that prompts the dm-mpath driver to probe all
active paths in the current path group to see whether they still work,
and fail them if not. If this returns success, userspace can retry the
ioctl and expect that the previously hit bad path is now failed (or
working again).

The immediate motivation for this is the use of SG_IO in QEMU for SCSI
passthrough. Following a failed SG_IO ioctl, QEMU will trigger probing
to ensure that all active paths are actually alive, so that retrying
SG_IO at least has a lower chance of failing due to a path error.
However, the problem is broader than just SG_IO (it affects any ioctl),
and if applications need failover support for other ioctls, the same
probing can be used.

This is not implemented on the DM control device, but on the DM mpath
block devices, to allow all users who have access to such a block device
to make use of this interface, specifically to implement failover for
ioctls. For the same reason, it is also unprivileged. Its implementation
is effectively just a bunch of reads, which could already be issued by
userspace, just without any guarantee that all the rights paths are
selected.

The probing implemented here is done fully synchronously path by path;
probing all paths concurrently is left as an improvement for the future.

Co-developed-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:06 +02:00
Kevin Wolf
4862c8861d dm: Allow .prepare_ioctl to handle ioctls directly
This adds a 'bool *forward' parameter to .prepare_ioctl, which allows
device mapper targets to accept ioctls to themselves instead of the
underlying device. If the target already fully handled the ioctl, it
sets *forward to false and device mapper won't forward it to the
underlying device any more.

In order for targets to actually know what the ioctl is about and how to
handle it, pass also cmd and arg.

As long as targets restrict themselves to interpreting ioctls of type
DM_IOCTL, this is a backwards compatible change because previously, any
such ioctl would have been passed down through all device mapper layers
until it reached a device that can't understand the ioctl and would
return an error.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00