Commit Graph

1412587 Commits

Author SHA1 Message Date
Caleb Sander Mateos
261b67f4e3 selftests: ublk: add utility to get block device metadata size
Some block device integrity parameters are available in sysfs, but
others are only accessible using the FS_IOC_GETLBMD_CAP ioctl. Add a
metadata_size utility program to print out the logical block metadata
size, PI offset, and PI size within the metadata. Example output:
$ metadata_size /dev/ublkb0
metadata_size: 64
pi_offset: 56
pi_tuple_size: 8

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:38 -07:00
Caleb Sander Mateos
c1d7c0f9cd selftests: ublk: display UBLK_F_INTEGRITY support
Add support for printing the UBLK_F_INTEGRITY feature flag in the
human-readable kublk features output.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:38 -07:00
Caleb Sander Mateos
bfe1255712 ublk: optimize ublk_user_copy() on daemon task
ublk user copy syscalls may be issued from any task, so they take a
reference count on the struct ublk_io to check whether it is owned by
the ublk server and prevent a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ
from completing the request. However, if the user copy syscall is issued
on the io's daemon task, a concurrent UBLK_IO_COMMIT_AND_FETCH_REQ isn't
possible, so the atomic reference count dance is unnecessary. Check for
UBLK_IO_FLAG_OWNED_BY_SRV to ensure the request is dispatched to the
sever and obtain the request from ublk_io's req field instead of looking
it up on the tagset. Skip the reference count increment and decrement.
Commit 8a8fe42d76 ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon
task") made an analogous optimization for ublk zero copy buffer
registration.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:38 -07:00
Stanley Zhang
b2503e936b ublk: support UBLK_F_INTEGRITY
Now that all the components of the ublk integrity feature have been
implemented, add UBLK_F_INTEGRITY to UBLK_F_ALL, conditional on block
layer integrity support (CONFIG_BLK_DEV_INTEGRITY). This allows ublk
servers to create ublk devices with UBLK_F_INTEGRITY set and
UBLK_U_CMD_GET_FEATURES to report the feature as supported.

Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
[csander: make feature conditional on CONFIG_BLK_DEV_INTEGRITY]
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:38 -07:00
Stanley Zhang
be82a89066 ublk: implement integrity user copy
Add a function ublk_copy_user_integrity() to copy integrity information
between a request and a user iov_iter. This mirrors the existing
ublk_copy_user_pages() but operates on request integrity data instead of
regular data. Check UBLKSRV_IO_INTEGRITY_FLAG in iocb->ki_pos in
ublk_user_copy() to choose between copying data or integrity data.

[csander: change offset units from data bytes to integrity data bytes,
 fix CONFIG_BLK_DEV_INTEGRITY=n build, rebase on user copy refactor]

Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:38 -07:00
Caleb Sander Mateos
fd5a005fa6 ublk: move offset check out of __ublk_check_and_get_req()
__ublk_check_and_get_req() checks that the passed in offset is within
the data length of the specified ublk request. However, only user copy
(ublk_check_and_get_req()) supports accessing ublk request data at a
nonzero offset. Zero-copy buffer registration (ublk_register_io_buf())
always passes 0 for the offset, so the check is unnecessary. Move the
check from __ublk_check_and_get_req() to ublk_check_and_get_req().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:16:33 -07:00
Caleb Sander Mateos
ca80afd870 ublk: inline ublk_check_and_get_req() into ublk_user_copy()
ublk_check_and_get_req() has a single callsite in ublk_user_copy(). It
takes a ton of arguments in order to pass local variables from
ublk_user_copy() to ublk_check_and_get_req() and vice versa. And more
are about to be added. Combine the functions to reduce the argument
passing noise.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Caleb Sander Mateos
5bfbbc9938 ublk: split out ublk_user_copy() helper
ublk_ch_read_iter() and ublk_ch_write_iter() are nearly identical except
for the iter direction. Split out a helper function ublk_user_copy() to
reduce the code duplication as these functions are about to get larger.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Caleb Sander Mateos
fc652d415c ublk: split out ublk_copy_user_bvec() helper
Factor a helper function ublk_copy_user_bvec() out of
ublk_copy_user_pages(). It will be used for copying integrity data too.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Caleb Sander Mateos
f82f0a16a8 ublk: set UBLK_IO_F_INTEGRITY in ublksrv_io_desc
Indicate to the ublk server when an incoming request has integrity data
by setting UBLK_IO_F_INTEGRITY in the ublksrv_io_desc's op_flags field.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Stanley Zhang
98bf225685 ublk: support UBLK_PARAM_TYPE_INTEGRITY in device creation
Add a feature flag UBLK_F_INTEGRITY for a ublk server to request
integrity/metadata support when creating a ublk device. The ublk server
can also check for the feature flag on the created device or the result
of UBLK_U_CMD_GET_FEATURES to tell if the ublk driver supports it.
UBLK_F_INTEGRITY requires UBLK_F_USER_COPY, as user copy is the only
data copy mode initially supported for integrity data.
Add UBLK_PARAM_TYPE_INTEGRITY and struct ublk_param_integrity to struct
ublk_params to specify the integrity params of a ublk device.
UBLK_PARAM_TYPE_INTEGRITY requires UBLK_F_INTEGRITY and a nonzero
metadata_size. The LBMD_PI_CAP_* and LBMD_PI_CSUM_* values from the
linux/fs.h UAPI header are used for the flags and csum_type fields.
If the UBLK_PARAM_TYPE_INTEGRITY flag is set, validate the integrity
parameters and apply them to the blk_integrity limits.
The struct ublk_param_integrity validations are based on the checks in
blk_validate_integrity_limits(). Any invalid parameters should be
rejected before being applied to struct blk_integrity.

[csander: drop redundant pi_tuple_size field, use block metadata UAPI
 constants, add param validation]

Signed-off-by: Stanley Zhang <stazhang@purestorage.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Caleb Sander Mateos
e859e7c26a ublk: move ublk flag check functions earlier
ublk_dev_support_user_copy() will be used in ublk_validate_params().
Move these functions next to ublk_{dev,queue}_is_zoned() to avoid
needing to forward-declare them.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Caleb Sander Mateos
835042fb19 blk-integrity: take const pointer in blk_integrity_rq()
blk_integrity_rq() doesn't modify the struct request passed in, so allow
a const pointer to be passed. Use a matching signature for the
!CONFIG_BLK_DEV_INTEGRITY version.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 09:15:05 -07:00
Jens Axboe
5df832ba5f Merge branch 'block-6.19' into for-7.0/block
Merge in fixes that went to 6.19 after for-7.0/block was branched.
Pending ublk changes depend on particularly the async scan work.

* block-6.19:
  block: zero non-PI portion of auto integrity buffer
  ublk: fix use-after-free in ublk_partition_scan_work
  blk-mq: avoid stall during boot due to synchronize_rcu_expedited
  loop: add missing bd_abort_claiming in loop_set_status
  block: don't merge bios with different app_tags
  blk-rq-qos: Remove unlikely() hints from QoS checks
  loop: don't change loop device under exclusive opener in loop_set_status
  block, bfq: update outdated comment
  blk-mq: skip CPU offline notify on unmapped hctx
  selftests/ublk: fix Makefile to rebuild on header changes
  selftests/ublk: add test for async partition scan
  ublk: scan partition in async way
  block,bfq: fix aux stat accumulation destination
  md: Fix forward incompatibility from configurable logical block size
  md: Fix logical_block_size configuration being overwritten
  md: suspend array while updating raid_disks via sysfs
  md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
  md: Fix static checker warning in analyze_sbs
2026-01-11 13:16:36 -07:00
Christoph Hellwig
bb8e2019ad blk-crypto: handle the fallback above the block layer
Add a blk_crypto_submit_bio helper that either submits the bio when
it is not encrypted or inline encryption is provided, but otherwise
handles the encryption before going down into the low-level driver.
This reduces the risk from bio reordering and keeps memory allocation
as high up in the stack as possible.

Note that if the submitter knows that inline enctryption is known to
be supported by the underyling driver, it can still use plain
submit_bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Christoph Hellwig
66e5a11d2e blk-crypto: optimize data unit alignment checking
Avoid the relatively high overhead of constructing and walking per-page
segment bio_vecs for data unit alignment checking by merging the checks
into existing loops.

For hardware support crypto, perform the check in bio_split_io_at, which
already contains a similar alignment check applied for all I/O.  This
means bio-based drivers that do not call bio_split_to_limits, should they
ever grow blk-crypto support, need to implement the check themselves,
just like all other queue limits checks.

For blk-crypto-fallback do it in the encryption/decryption loops.  This
means alignment errors for decryption will only be detected after I/O
has completed, but that seems like a worthwhile trade off.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Christoph Hellwig
3d939695e6 blk-crypto: use mempool_alloc_bulk for encrypted bio page allocation
Calling mempool_alloc in a loop is not safe unless the maximum allocation
size times the maximum number of threads using it is less than the
minimum pool size.  Use the new mempool_alloc_bulk helper to allocate
all missing elements in one pass to remove this deadlock risk.  This
also means that non-pool allocations now use alloc_pages_bulk which can
be significantly faster than a loop over individual page allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Christoph Hellwig
2f655dcb2d blk-crypto: use on-stack skcipher requests for fallback en/decryption
Allocating a skcipher request dynamically can deadlock or cause
unexpected I/O failures when called from writeback context.  Avoid the
allocation entirely by using on-stack skciphers, similar to what the
non-blk-crypto fscrypt path already does.

This drops the incomplete support for asynchronous algorithms, which
previously could be used, but only synchronously.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Christoph Hellwig
b37fbce460 blk-crypto: optimize bio splitting in blk_crypto_fallback_encrypt_bio
The current code in blk_crypto_fallback_encrypt_bio is inefficient and
prone to deadlocks under memory pressure: It first walks the passed in
plaintext bio to see how much of it can fit into a single encrypted
bio using up to BIO_MAX_VEC PAGE_SIZE segments, and then allocates a
plaintext clone that fits the size, only to allocate another bio for
the ciphertext later.  While the plaintext clone uses a bioset to avoid
deadlocks when allocations could fail, the ciphertex one uses bio_kmalloc
which is a no-go in the file system I/O path.

Switch blk_crypto_fallback_encrypt_bio to walk the source plaintext bio
while consuming bi_iter without cloning it, and instead allocate a
ciphertext bio at the beginning and whenever we fille up the previous
one.  The existing bio_set for the plaintext clones is reused for the
ciphertext bios to remove the deadlock risk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Christoph Hellwig
aefc2a1fa2 blk-crypto: submit the encrypted bio in blk_crypto_fallback_bio_prep
Restructure blk_crypto_fallback_bio_prep so that it always submits the
encrypted bio instead of passing it back to the caller, which allows
to simplify the calling conventions for blk_crypto_fallback_bio_prep and
blk_crypto_bio_prep so that they never have to return a bio, and can
use a true return value to indicate that the caller should submit the
bio, and false that the blk-crypto code consumed it.

The submission is handled by the on-stack bio list in the current
task_struct by the block layer and does not cause additional stack
usage or major overhead.  It also prepares for the following optimization
and fixes for the blk-crypto fallback write path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Christoph Hellwig
a3cc978e61 blk-crypto: add a bio_crypt_ctx() helper
This returns the bio_crypt_ctx if CONFIG_BLK_INLINE_ENCRYPTION is enabled
and a crypto context is attached to the bio, else NULL.

The use case is to allow safely dereferencing the context in common code
without needed #ifdef CONFIG_BLK_INLINE_ENCRYPTION.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Christoph Hellwig
bc26e2efa2 fscrypt: keep multiple bios in flight in fscrypt_zeroout_range_inline_crypt
This should slightly improve performance for large zeroing operations,
but more importantly prepares for blk-crypto refactoring that requires
all fscrypt users to call submit_bio directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Christoph Hellwig
c22756a997 fscrypt: pass a real sector_t to fscrypt_zeroout_range_inline_crypt
While the pblk argument to fscrypt_zeroout_range_inline_crypt is
declared as a sector_t it actually is interpreted as a logical block
size unit, which is highly unusual.  Switch to passing the 512 byte
units that sector_t is defined for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-11 12:55:41 -07:00
Ming Lei
f7ba87dfa8 block: account for bi_bvec_done in bio_may_need_split()
When checking if a bio fits in a single segment, bio_may_need_split()
compares bi_size against the current bvec's bv_len. However, for
partially consumed bvecs (bi_bvec_done > 0), such as in cloned or
split bios, the remaining bytes in the current bvec is actually
(bv_len - bi_bvec_done), not bv_len.

This could cause bio_may_need_split() to incorrectly return false,
leading to nr_phys_segments being set to 1 when the bio actually
spans multiple segments. This triggers the WARN_ON in __blk_rq_map_sg()
when the actual mapped segments exceed the expected count.

Fix by subtracting bi_bvec_done from bv_len in the comparison.

Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Close: https://lore.kernel.org/linux-block/9687cf2b-1f32-44e1-b58d-2492dc6e7185@linux.ibm.com/
Repored-and-bisected-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Tested-by: Christoph Hellwig <hch@infradead.org>
Fixes: ee623c892a ("block: use bvec iterator helper for bio_may_need_split()")
Cc: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-10 10:22:54 -07:00
Caleb Sander Mateos
a31bde687b block: use pi_tuple_size in bi_offload_capable()
bi_offload_capable() returns whether a block device's metadata size
matches its PI tuple size. Use pi_tuple_size instead of switching on
csum_type. This makes the code considerably simpler and less branchy.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-10 10:22:54 -07:00
Caleb Sander Mateos
ca22c566b8 block: zero non-PI portion of auto integrity buffer
The auto-generated integrity buffer for writes needs to be fully
initialized before being passed to the underlying block device,
otherwise the uninitialized memory can be read back by userspace or
anyone with physical access to the storage device. If protection
information is generated, that portion of the integrity buffer is
already initialized. The integrity data is also zeroed if PI generation
is disabled via sysfs or the PI tuple size is 0. However, this misses
the case where PI is generated and the PI tuple size is nonzero, but the
metadata size is larger than the PI tuple. In this case, the remainder
("opaque") of the metadata is left uninitialized.
Generalize the BLK_INTEGRITY_CSUM_NONE check to cover any case when the
metadata is larger than just the PI tuple.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: c546d6f438 ("block: only zero non-PI metadata tuples in bio_integrity_prep")
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-10 10:22:07 -07:00
Ming Lei
f0d385f668 ublk: fix use-after-free in ublk_partition_scan_work
A race condition exists between the async partition scan work and device
teardown that can lead to a use-after-free of ub->ub_disk:

1. ublk_ctrl_start_dev() schedules partition_scan_work after add_disk()
2. ublk_stop_dev() calls ublk_stop_dev_unlocked() which does:
   - del_gendisk(ub->ub_disk)
   - ublk_detach_disk() sets ub->ub_disk = NULL
   - put_disk() which may free the disk
3. The worker ublk_partition_scan_work() then dereferences ub->ub_disk
   leading to UAF

Fix this by using ublk_get_disk()/ublk_put_disk() in the worker to hold
a reference to the disk during the partition scan. The spinlock in
ublk_get_disk() synchronizes with ublk_detach_disk() ensuring the worker
either gets a valid reference or sees NULL and exits early.

Also change flush_work() to cancel_work_sync() to avoid running the
partition scan work unnecessarily when the disk is already detached.

Fixes: 7fc4da6a30 ("ublk: scan partition in async way")
Reported-by: Ruikai Peng <ruikai@pwno.io>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-09 06:55:30 -07:00
Mikulas Patocka
9670db22e7 blk-mq: avoid stall during boot due to synchronize_rcu_expedited
On the kernel 6.19-rc, I am experiencing 15-second boot stall in a
virtual machine when probing a virtio-scsi disk:
[    1.011641] SCSI subsystem initialized
[    1.013972] virtio_scsi virtio6: 16/0/0 default/read/poll queues
[    1.015983] scsi host0: Virtio SCSI HBA
[    1.019578] ACPI: \_SB_.GSIA: Enabled at IRQ 16
[    1.020225] ahci 0000:00:1f.2: AHCI vers 0001.0000, 32 command slots, 1.5 Gbps, SATA mode
[    1.020228] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
[    1.020230] ahci 0000:00:1f.2: flags: 64bit ncq only
[    1.024688] scsi host1: ahci
[    1.025432] scsi host2: ahci
[    1.025966] scsi host3: ahci
[    1.026511] scsi host4: ahci
[    1.028371] scsi host5: ahci
[    1.028918] scsi host6: ahci
[    1.029266] ata1: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23100 irq 16 lpm-pol 1
[    1.029305] ata2: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23180 irq 16 lpm-pol 1
[    1.029316] ata3: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23200 irq 16 lpm-pol 1
[    1.029327] ata4: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23280 irq 16 lpm-pol 1
[    1.029341] ata5: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23300 irq 16 lpm-pol 1
[    1.029356] ata6: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23380 irq 16 lpm-pol 1
[    1.118111] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5
[    1.348916] ata1: SATA link down (SStatus 0 SControl 300)
[    1.350713] ata2: SATA link down (SStatus 0 SControl 300)
[    1.351025] ata6: SATA link down (SStatus 0 SControl 300)
[    1.351160] ata5: SATA link down (SStatus 0 SControl 300)
[    1.351326] ata3: SATA link down (SStatus 0 SControl 300)
[    1.351536] ata4: SATA link down (SStatus 0 SControl 300)
[    1.449153] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input2
[   16.483477] sd 0:0:0:0: Power-on or device reset occurred
[   16.483691] sd 0:0:0:0: [sda] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
[   16.483762] sd 0:0:0:0: [sda] Write Protect is off
[   16.483877] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   16.569225] sd 0:0:0:0: [sda] Attached SCSI disk

I bisected it and it is caused by the commit 89e1fb7cef which
introduces calls to synchronize_rcu_expedited.

This commit replaces synchronize_rcu_expedited and kfree with a call to
kfree_rcu_mightsleep, avoiding the 15-second delay.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 89e1fb7cef ("blk-mq: fix potential uaf for 'queue_hw_ctx'")
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07 08:10:42 -07:00
Ming Lei
15f506a77a io_uring: remove nr_segs recalculation in io_import_kbuf()
io_import_kbuf() recalculates iter->nr_segs to reflect only the bvecs
needed for the requested byte range. This was added to provide an
accurate segment count to bio_iov_bvec_set(), which copied nr_segs to
bio->bi_vcnt for use as a bio split hint.

The previous two patches eliminated this dependency:
 - bio_may_need_split() now uses bi_iter instead of bi_vcnt for split
   decisions
 - bio_iov_bvec_set() no longer copies nr_segs to bi_vcnt

Since nr_segs is no longer used for bio split decisions, the
recalculation loop is unnecessary. The iov_iter already has the correct
bi_size to cap iteration, so an oversized nr_segs is harmless.

Link: https://lkml.org/lkml/2025/4/16/351
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07 08:06:33 -07:00
Ming Lei
6418643148 block: don't initialize bi_vcnt for cloned bio in bio_iov_bvec_set()
bio_iov_bvec_set() creates a cloned bio that borrows a bvec array from
an iov_iter. For cloned bios, bi_vcnt is meaningless because iteration
is controlled entirely by bi_iter (bi_idx, bi_size, bi_bvec_done), not
by bi_vcnt. Remove the incorrect bi_vcnt assignment.

Explicitly initialize bi_iter.bi_idx to 0 to ensure iteration starts
at the first bvec. While bi_idx is typically already zero from bio
initialization, making this explicit improves clarity and correctness.

This change also avoids accessing iter->nr_segs, which is an iov_iter
implementation detail that block code should not depend on.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07 08:06:33 -07:00
Ming Lei
ee623c892a block: use bvec iterator helper for bio_may_need_split()
bio_may_need_split() uses bi_vcnt to determine if a bio has a single
segment, but bi_vcnt is unreliable for cloned bios. Cloned bios share
the parent's bi_io_vec array but iterate over a subset via bi_iter,
so bi_vcnt may not reflect the actual segment count being iterated.

Replace the bi_vcnt check with bvec iterator access via
__bvec_iter_bvec(), comparing bi_iter.bi_size against the current
bvec's length. This correctly handles both cloned and non-cloned bios.

Move bi_io_vec into the first cache line adjacent to bi_iter. This is
a sensible layout since bi_io_vec and bi_iter are commonly accessed
together throughout the block layer - every bvec iteration requires
both fields. This displaces bi_end_io to the second cache line, which
is acceptable since bi_end_io and bi_private are always fetched
together in bio_endio() anyway.

The struct layout change requires bio_reset() to preserve and restore
bi_io_vec across the memset, since it now falls within BIO_RESET_BYTES.

Nitesh verified that this patch doesn't regress NVMe 512-byte IO perf [1].

Link: https://lore.kernel.org/linux-block/20251220081607.tvnrltcngl3cc2fh@green245.gost/ [1]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07 08:06:33 -07:00
Tetsuo Handa
2704024d83 loop: add missing bd_abort_claiming in loop_set_status
Commit 08e136ebd1 ("loop: don't change loop device under exclusive
opener in loop_set_status") forgot to call bd_abort_claiming() when
mutex_lock_killable() failed.

Fixes: 08e136ebd1 ("loop: don't change loop device under exclusive opener in loop_set_status")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07 08:04:42 -07:00
Caleb Sander Mateos
6acd4ac5f8 block: don't merge bios with different app_tags
nvme_set_app_tag() uses the app_tag value from the bio_integrity_payload
of the struct request's first bio. This assumes all the request's bios
have the same app_tag. However, it is possible for bios with different
app_tag values to be merged into a single request.
Add a check in blk_integrity_merge_{bio,rq}() to prevent the merging of
bios/requests with different app_tag values if BIP_CHECK_APPTAG is set.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 3d8b5a22d4 ("block: add support to pass user meta buffer")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 19:10:08 -07:00
Breno Leitao
7d121d701d blk-rq-qos: Remove unlikely() hints from QoS checks
The unlikely() annotations on QUEUE_FLAG_QOS_ENABLED checks are
counterproductive. Writeback throttling (WBT) might be enabled by
default, mainly because CONFIG_BLK_WBT_MQ defaults to 'y'.

Branch profiling on Meta servers, which have WBT enabled, confirms 100%
misprediction rates on these checks.

Remove the unlikely() annotations to let the CPU's branch predictor
learn the actual behavior, potentially improving I/O path performance.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 19:08:23 -07:00
Raphael Pinsonneault-Thibeault
08e136ebd1 loop: don't change loop device under exclusive opener in loop_set_status
loop_set_status() is allowed to change the loop device while there
are other openers of the device, even exclusive ones.

In this case, it causes a KASAN: slab-out-of-bounds Read in
ext4_search_dir(), since when looking for an entry in an inlined
directory, e_value_offs is changed underneath the filesystem by
loop_set_status().

Fix the problem by forbidding loop_set_status() from modifying the loop
device while there are exclusive openers of the device. This is similar
to the fix in loop_configure() by commit 33ec3e53e7 ("loop: Don't
change loop device under exclusive opener") alongside commit ecbe6bc000
("block: use bd_prepare_to_claim directly in the loop driver").

Reported-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3ee481e21fd75e14c397
Tested-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com
Tested-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:30:18 -07:00
Md Haris Iqbal
69d26698e4 rnbd-srv: Zero the rsp buffer before using it
Before using the data buffer to send back the response message, zero it
completely. This prevents any stray bytes to be picked up by the client
side when there the message is exchanged between different protocol
versions.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Florian-Ewald Mueller
4ac9690d4b rnbd-srv: Fix server side setting of bi_size for special IOs
On rnbd-srv, the bi_size of the bio is set during the bio_add_page
function, to which datalen is passed. But for special IOs like DISCARD
and WRITE_ZEROES, datalen is 0, since there is no data to write. For
these special IOs, use the bi_size of the rnbd_msg_io.

Fixes: f6f84be089 ("block/rnbd-srv: Add sanity check and remove redundant assignment")
Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Jack Wang
e1384543e8 rnbd-srv: fix the trace format for flags
The __print_flags helper meant for bitmask, while the rnbd_rw_flags is
mixed with bitmask and enum, to avoid confusion, just print the data
as it is.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Md Haris Iqbal
ef63e9ef76 block/rnbd-proto: Check and retain the NOUNMAP flag for requests
The NOUNMAP flag is in combination with WRITE_ZEROES flag to indicate
that the upper layers wants the sectors zeroed, but does not want it to
get freed. This instruction is especially important for storage stacks
which involves a layer capable of thin provisioning.

This commit makes RNBD block device transfer and retain this NOUNMAP flag
for requests, so it can be passed onto the backend device on the server
side.

Since it is a change in the wire protocol, bump the minor version of
protocol.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Zhu Yanjun
581cf833ca block: rnbd: add .release to rnbd_dev_ktype
Every ktype must provides a .release function that will be called after
the last kobject_put.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Md Haris Iqbal
483cbec342 block/rnbd-proto: Handle PREFLUSH flag properly for IOs
In RNBD client, for a WRITE request of size 0, with only the REQ_PREFLUSH
bit set, while converting from bio_opf to rnbd_opf, we do REQ_OP_WRITE to
RNBD_OP_WRITE, and then check if the rq is flush through function
op_is_flush. That function checks both REQ_PREFLUSH and REQ_FUA flag, and
if any of them is set, the RNBD_F_FUA is set.
On the RNBD server side, while converting the RNBD flags to req flags, if
the RNBD_F_FUA flag is set, we just set the REQ_FUA flag. This means we
have lost the PREFLUSH flag, and added the REQ_FUA flag in its place.

This commits adds a new RNBD_F_PREFLUSH flag, and also adds separate
handling for REQ_PREFLUSH flag. On the server side, if the RNBD_F_PREFLUSH
is present, the REQ_PREFLUSH is added to the bio.

Since it is a change in the wire protocol, bump the minor version of
protocol.
The change is backwards compatible, and does not change the functionality
if either the client or the server is running older/newer versions.
If the client side is running the older version, both REQ_PREFLUSH and
REQ_FUA is converted to RNBD_F_FUA. The server running newer one would
still add only the REQ_FUA flag which is what happens when both client and
server is running the older version.
If the client side is running the newer version, just like before a
RNBD_F_FUA is added, but now a RNBD_F_PREFLUSH is also added to the
rnbd_opf. In case the server is running the older version the
RNBD_F_PREFLUSH is ignored, and only the RNBD_F_FUA is processed.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06 05:28:10 -07:00
Julia Lawall
69153e8b97 block, bfq: update outdated comment
The function bfq_bfqq_may_idle() was renamed as bfq_better_to_idle()
in commit 277a4a9b56 ("block, bfq: give a better name to
bfq_bfqq_may_idle").  Update the comment accordingly.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-01 08:57:37 -07:00
Jens Axboe
9e193a06e6 Merge tag 'md-6.19-20251231' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into block-6.19
Pull MD fixes from Yu Kuai:

"- Fix null-pointer dereference in raid5 sysfs group_thread_cnt store
   (Tuo Li)
 - Fix possible mempool corruption during raid1 raid_disks update via
   sysfs (FengWei Shih)
 - Fix logical_block_size configuration being overwritten during
   super_1_validate() (Li Nan)
 - Fix forward incompatibility with configurable logical block size:
   arrays assembled on new kernels could not be assembled on kernels
   <=6.18 due to non-zero reserved pad rejection (Li Nan)
 - Fix static checker warning about iterator not incremented (Li Nan)"

* tag 'md-6.19-20251231' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux:
  md: Fix forward incompatibility from configurable logical block size
  md: Fix logical_block_size configuration being overwritten
  md: suspend array while updating raid_disks via sysfs
  md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()
  md: Fix static checker warning in analyze_sbs
2025-12-31 06:55:07 -07:00
Cong Zhang
10845a105b blk-mq: skip CPU offline notify on unmapped hctx
If an hctx has no software ctx mapped, blk_mq_map_swqueue() never
allocates tags and leaves hctx->tags NULL. The CPU hotplug offline
notifier can still run for that hctx, return early since hctx cannot
hold any requests.

Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com>
Fixes: bf0beec060 ("blk-mq: drain I/O when all CPUs in a hctx are offline")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-30 09:02:22 -07:00
Christophe JAILLET
9e371032cb null_blk: Constify struct configfs_item_operations and configfs_group_operations
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.

Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.

On a x86_64, with allmodconfig:
Before:
======
   text	   data	    bss	    dec	    hex	filename
 100263	  37808	   2752	 140823	  22617	drivers/block/null_blk/main.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
 100423	  37648	   2752	 140823	  22617	drivers/block/null_blk/main.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-29 08:19:15 -07:00
Thorsten Blum
e1418af766 brd: replace simple_strtol with kstrtoul in ramdisk_size
Replace simple_strtol() with the recommended kstrtoul() for parsing the
'ramdisk_size=' boot parameter. Unlike simple_strtol(), which returns a
long, kstrtoul() converts the string directly to an unsigned long and
avoids implicit casting.

Check the return value of kstrtoul() and reject invalid values. This
adds error handling while preserving behavior for existing values, and
removes use of the deprecated simple_strtol() helper. The current code
silently sets 'rd_size = 0' if parsing fails, instead of leaving the
default value (CONFIG_BLK_DEV_RAM_SIZE) unchanged.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28 15:54:38 -07:00
Tamir Duberstein
4cef2fcda3 rnull: replace kernel::c_str! with C-Strings
C-String literals were added in Rust 1.77. Replace instances of
`kernel::c_str!` with C-String literals where possible.

Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28 15:54:38 -07:00
Linus Torvalds
f8f9c1f4d0 Linux 6.19-rc3 v6.19-rc3 2025-12-28 13:24:26 -08:00
Linus Torvalds
c875a6c324 Merge tag 'usb-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
 "Here are some small USB fixes, and bunch of reverts for 6.19-rc3.
  Included in here are:

   - reverts of some typec ucsi driver changes that had a lot of
     regression reports after -rc1. Let's just revert it all for now and
     it will come back in a way that is better tested.

   - other typec bugfixes

   - usb-storage quirk fixups

   - dwc3 driver fix

   - other minor USB fixes for reported problems.

  All of these have passed 0-day testing and individual testing"

* tag 'usb-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (22 commits)
  Revert "usb: typec: ucsi: Update UCSI structure to have message in and message out fields"
  Revert "usb: typec: ucsi: Add support for message out data structure"
  Revert "usb: typec: ucsi: Enable debugfs for message_out data structure"
  Revert "usb: typec: ucsi: Add support for SET_PDOS command"
  Revert "usb: typec: ucsi: Fix null pointer dereference in ucsi_sync_control_common"
  Revert "usb: typec: ucsi: Get connector status after enable notifications"
  usb: ohci-nxp: clean up probe error labels
  usb: gadget: lpc32xx_udc: clean up probe error labels
  usb: ohci-nxp: fix device leak on probe failure
  usb: phy: isp1301: fix non-OF device reference imbalance
  usb: gadget: lpc32xx_udc: fix clock imbalance in error path
  usb: typec: ucsi: Get connector status after enable notifications
  usb: usb-storage: Maintain minimal modifications to the bcdDevice range.
  usb: dwc3: of-simple: fix clock resource leak in dwc3_of_simple_probe
  usb: typec: ucsi: Fix null pointer dereference in ucsi_sync_control_common
  USB: lpc32xx_udc: Fix error handling in probe
  usb: typec: altmodes/displayport: Drop the device reference in dp_altmode_probe()
  usb: phy: fsl-usb: Fix use-after-free in delayed work during device removal
  usb: renesas_usbhs: Fix a resource leak in usbhs_pipe_malloc()
  usb: typec: ucsi: huawei-gaokin: add DRM dependency
  ...
2025-12-28 10:21:47 -08:00
Linus Torvalds
15225b910c Merge tag 'tty-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull serial driver fixes from Greg KH:
 "Here are some small serial driver fixes for some reported issues.
  Included in here are:

   - serial sysfs fwnode fix that was much reported

   - sh-sci driver fix

   - serial device init bugfix

   - 8250 bugfix

   - xilinx_uartps bugfix

  All of these have passed 0-day testing and individual testing"

* tag 'tty-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: xilinx_uartps: fix rs485 delay_rts_after_send
  serial: sh-sci: Check that the DMA cookie is valid
  serial: core: Fix serial device initialization
  serial: 8250: longson: Fix NULL vs IS_ERR() bug in probe
  serial: core: Restore sysfs fwnode information
2025-12-28 10:14:49 -08:00