bio->issue_time_ns is only used by blk-iolatency, which can only be
enabled for rq-based disk, hence it's not necessary to initialize
the time for bio-based disk.
Meanwhile, if bio is split by blk_crypto_fallback_split_bio_if_needed(),
the issue time is not initialized for new split bio, this can be fixed
as well.
Noted the next patch will optimize better that bio issue time will
only be used when blk-iolatency is really enabled by the disk.
Fixes: 488f6682c8 ("block: blk-crypto-fallback for Inline Encryption")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that bio->bi_issue is only used by blk-iolatency to get bio issue
time, replace bio_issue with u64 time directly and remove bio_issue to
make code cleaner.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull MD changes from Yu Kuai:
"Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.
Bitmap is used to record which data blocks have been synchronized and
which ones need to be resynchronized or recovered. Each bit in the
bitmap represents a segment of data in the array. When a bit is set,
it indicates that the multiple redundant copies of that data segment
may not be consistent. Data synchronization can be performed based on
the bitmap after power failure or readding a disk. If there is no
bitmap, a full disk synchronization is required.
Due to known performance issues with md-bitmap and the unreasonable
implementations:
- self-managed IO submitting like filemap_write_page();
- global spin_lock
I have decided not to continue optimizing based on the current bitmap
implementation, this new bitmap is invented without locking from IO fast
path and can be used with fast disks.
Key features for the new bitmap:
- IO fastpath is lockless, if user issues lots of write IO to the same
bitmap bit in a short time, only the first write has additional
overhead to update bitmap bit, no additional overhead for the
following writes;
- support only resync or recover written data, means in the case
creating new array or replacing with a new disk, there is no need to
do a full disk resync/recovery;"
* tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: (24 commits)
md/md-llbitmap: introduce new lockless bitmap
md/md-bitmap: make method bitmap_ops->daemon_work optional
md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
md/md-bitmap: add a new method blocks_synced() in bitmap_operations
md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
md/md-bitmap: delay registration of bitmap_ops until creating bitmap
md/md-bitmap: add a new sysfs api bitmap_type
md: add a new mddev field 'bitmap_id'
md/md-bitmap: support discard for bitmap ops
md: factor out a helper raid_is_456()
md: add a new parameter 'offset' to md_super_write()
md/md-bitmap: introduce CONFIG_MD_BITMAP
md: check before referencing mddev->bitmap_ops
md/dm-raid: check before referencing mddev->bitmap_ops
md/raid5: check before referencing mddev->bitmap_ops
md/raid10: check before referencing mddev->bitmap_ops
md/raid1: check before referencing mddev->bitmap_ops
md/raid1: check bitmap before behind write
md/md-bitmap: handle the case bitmap is not enabled before end_sync()
md/md-bitmap: handle the case bitmap is not enabled before start_sync()
...
We can now safely provide a block device when extracting user pages for
driver and user passthrough commands. Set the bdev so the caller doesn't
have to do that later. This has an additional benefit of being able to
extract P2P pages in the passthrough path.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only need to consider data and metadata dma mapping types separately.
The request and bio integrity payload have enough flag bits to
internally track the mapping type for each. Use these so the caller
doesn't need to track them, and provide separete request and integrity
helpers to the common code. This will make it easier to scale new
mappings, like the proposed MMIO attribute, without burdening the caller
to track such things.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set the extraction flags to allow p2p pages for the metadata buffer if
the block device allows it. Similar to data payloads, ensure the bio
does not use merging if we see a p2p page.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're checking length and addresses against the same alignment value, so
use the more simple iterator check.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of ensuring each vector is block size aligned while constructing
the bio, just ensure the entire size is aligned after it's built. This
makes getting bio pages more flexible to accepting device valid io
vectors that would otherwise get rejected by alignment checks.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer tries to align bio vectors to the block device's logical
block size. Some cases don't have a block device, or we may need to
align to something larger, which we can't derive it from the queue
limits. Have the caller specify what they want, or allow any length
alignment if nothing was specified. Since the most common use case
relies on the block device's limits, a helper function is provided.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're already iterating every segment, so check these for a valid IO
lengths at the same time. Individual segment lengths will not be checked
on passthrough commands. The read/write command segments must be sized
to the dma alignment.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This patch adds a new WQ_PERCPU flag to explicitly request the use of
the per-CPU behavior. Both flags coexist for one release cycle to allow
callers to transition their calls.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
All existing users have been updated accordingly.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace kmalloc() followed by copy_from_user() with memdup_user() to
improve and simplify raw_cmd_copyin().
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add missing documentation for the tags_srcu member that was introduced
to defer freeing of tags page_list to prevent use-after-free when
iterating tags.
Fixes htmldocs warning:
WARNING: include/linux/blk-mq.h:536 struct member 'tags_srcu' not described in 'blk_mq_tag_set'
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bios are embedded into other structures, and at least spare is unhappy
about embedding structures with variable sized arrays. There's no
real need to the array anyway, we can replace it with a helper pointing
to the memory just behind the bio, and with the previous cleanups there
is very few site doing anything special with it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just a simpler wrapper around bio_init for callers that want to
initialize a bio with inline bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On repeated cold boots we occasionally hit a NULL pointer crash in
blk_should_throtl() when throttling is consulted before the throttle
policy is fully enabled for the queue. Checking only q->td != NULL is
insufficient during early initialization, so blkg_to_pd() for the
throttle policy can still return NULL and blkg_to_tg() becomes NULL,
which later gets dereferenced.
Unable to handle kernel NULL pointer dereference
at virtual address 0000000000000156
...
pc : submit_bio_noacct+0x14c/0x4c8
lr : submit_bio_noacct+0x48/0x4c8
sp : ffff800087f0b690
x29: ffff800087f0b690 x28: 0000000000005f90 x27: ffff00068af393c0
x26: 0000000000080000 x25: 000000000002fbc0 x24: ffff000684ddcc70
x23: 0000000000000000 x22: 0000000000000000 x21: 0000000000000000
x20: 0000000000080000 x19: ffff000684ddcd08 x18: ffffffffffffffff
x17: 0000000000000000 x16: ffff80008132a550 x15: 0000ffff98020fff
x14: 0000000000000000 x13: 1fffe000d11d7021 x12: ffff000688eb810c
x11: ffff00077ec4bb80 x10: ffff000688dcb720 x9 : ffff80008068ef60
x8 : 00000a6fb8a86e85 x7 : 000000000000111e x6 : 0000000000000002
x5 : 0000000000000246 x4 : 0000000000015cff x3 : 0000000000394500
x2 : ffff000682e35e40 x1 : 0000000000364940 x0 : 000000000000001a
Call trace:
submit_bio_noacct+0x14c/0x4c8
verity_map+0x178/0x2c8
__map_bio+0x228/0x250
dm_submit_bio+0x1c4/0x678
__submit_bio+0x170/0x230
submit_bio_noacct_nocheck+0x16c/0x388
submit_bio_noacct+0x16c/0x4c8
submit_bio+0xb4/0x210
f2fs_submit_read_bio+0x4c/0xf0
f2fs_mpage_readpages+0x3b0/0x5f0
f2fs_readahead+0x90/0xe8
Tighten blk_throtl_activated() to also require that the throttle policy
bit is set on the queue:
return q->td != NULL &&
test_bit(blkcg_policy_throtl.plid, q->blkcg_pols);
This prevents blk_should_throtl() from accessing throttle group state
until policy data has been attached to blkgs.
Fixes: a3166c5170 ("blk-throttle: delay initialization until configuration")
Co-developed-by: Liang Jie <liangjie@lixiang.com>
Signed-off-by: Liang Jie <liangjie@lixiang.com>
Signed-off-by: Han Guangjiang <hanguangjiang@lixiang.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When executing modinfo null_blk, there is an error in the description
of module parameter mbps, and the output information of cache_size is
incomplete.The output of modinfo before and after applying this patch
is as follows:
Before:
[...]
parm: cache_size:ulong
[...]
parm: mbps:Cache size in MiB for memory-backed device.
Default: 0 (none) (uint)
[...]
After:
[...]
parm: cache_size:Cache size in MiB for memory-backed device.
Default: 0 (none) (ulong)
[...]
parm: mbps:Limit maximum bandwidth (in MiB/s).
Default: 0 (no limit) (uint)
[...]
Fixes: 058efe000b ("null_blk: add module parameters for 4 options")
Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
around the tag iterators.
This is done by:
- Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
- Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
This change fixes lockup issue in scsi_host_busy() in case of shost->host_blocked.
Also avoids big tags->lock when reading disk sysfs attribute `inflight`.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The freeing of the flush queue/request in blk_mq_exit_hctx() can race with
tag iterators that may still be accessing it. To prevent a potential
use-after-free, the deallocation should be deferred until after a grace
period. With this way, we can replace the big tags->lock in tags iterator
code path with srcu for solving the issue.
This patch introduces an SRCU-based deferred freeing mechanism for the
flush queue.
The changes include:
- Adding a `rcu_head` to `struct blk_flush_queue`.
- Creating a new callback function, `blk_free_flush_queue_callback`,
to handle the actual freeing.
- Replacing the direct call to `blk_free_flush_queue()` in
`blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu`
instance to ensure synchronization with tag iterators.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tag iterators can race with the freeing of the request pages(tags->page_list),
potentially leading to use-after-free issues.
Defer the freeing of the page list and the tags structure itself until
after an SRCU grace period has passed. This ensures that any concurrent
tag iterators have completed before the memory is released. With this
way, we can replace the big tags->lock in tags iterator code path with
srcu for solving the issue.
This is achieved by:
- Adding a new `srcu_struct tags_srcu` to `blk_mq_tag_set` to protect
tag map iteration.
- Adding an `rcu_head` to `struct blk_mq_tags` to be used with
`call_srcu`.
- Moving the page list freeing logic and the `kfree(tags)` call into a
new callback function, `blk_mq_free_tags_callback`.
- In `blk_mq_free_tags`, invoking `call_srcu` to schedule the new
callback for deferred execution.
The read-side protection for the tag iterators will be added in a
subsequent patch.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To prepare for converting the tag->rqs freeing to be SRCU-based, the
tag_set is needed in the freeing helper functions.
This patch adds 'struct blk_mq_tag_set *' as the first parameter to
blk_mq_free_rq_map() and blk_mq_free_tags(), and updates all their call
sites.
This allows access to the tag_set's SRCU structure in the next step,
which will be used to free the tag maps after a grace period.
No functional change is intended in this patch.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move flush queue allocation into blk_mq_init_hctx() and its release into
blk_mq_exit_hctx(), and prepare for replacing tags->lock with SRCU to
draining inflight request walking. blk_mq_exit_hctx() is the last chance
for us to get valid `tag_set` reference, and we need to add one SRCU to
`tag_set` for freeing flush request via call_srcu().
It is safe to move flush queue & request release into blk_mq_exit_hctx(),
because blk_mq_clear_flush_rq_mapping() clears the flush request
reference int driver tags inflight request table, meantime inflight
request walking is drained.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.
Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.
Due to known performance issues with md-bitmap and the unreasonable
implementations:
- self-managed IO submitting like filemap_write_page();
- global spin_lock
I have decided not to continue optimizing based on the current bitmap
implementation, this new bitmap is invented without locking from IO fast
path and can be used with fast disks.
For designs and details, see the comments in drivers/md-llbitmap.c.
Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-12-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Li Nan <linan122@huawei.com>
daemon_work() will be called by daemon thread, on the one hand, daemon
thread doesn't have strict wake-up time; on the other hand, too much
work are put to daemon thread, like handle sync IO, handle failed
or specail normal IO, handle recovery, and so on. Hence daemon thread
may be too busy to clear dirty bits in time.
Make bitmap_ops->daemon_work() optional and following patches will use
separate async work to clear dirty bits for the new bitmap.
Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-11-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Currently, raid456 must perform a whole array initial recovery to build
initail xor data, then IO to the array won't have to read all the blocks
in underlying disks.
This behavior will affect IO performance a lot, and nowadays there are
huge disks and the initial recovery can take a long time. Hence llbitmap
will support lazy initial recovery in following patches. This method is
used to check if data blocks is synced or not, if not then IO will still
have to read all blocks for raid456.
Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
behind write rely on bitmap, because the number of IO are recorded in
bitmap->behind_writes, and callers rely on bitmap_wait_behind_writes()
to wait for IO to be done.
However, currently callers doesn't check if bitmap is enabeld before
calling into behind methods. Hence if behind write start without bitmap,
readers will not wait for slow write IO to be done and old data can be
read in some corner cases.
Link: https://lore.kernel.org/linux-raid/20250707012711.376844-10-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>