2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
linux/io_uring
Gabriel Krisman Bertazi 92835cebab io_uring/sqpoll: Increase task_work submission batch size
Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f889 ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep.  While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.

There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation.  But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput.  The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.

My benchmark is:

fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
    --invalidate=1 --time_based  --ramp_time=10 --group_reporting=1 \
    --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
    --rw=randread --numjobs=1 --sqthread_poll

In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
  READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec

With this patch:
  READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec

which is a 15% improvement in measured bandwidth.  The average
submission latency is noticeably lowered too.  As measured by
fio:

Baseline:
   lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
   lat (usec): min=26, max=226, avg=86.48, stdev=4.82

If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.

In the baseline, after a stabilization phase, an ordinary submission
looks like:
  254,0    1    49942     0.016028795  5977  U   N [iou-sqp-5976] 7

After patching, I see consistently more requests per unplug.
  254,0    1     4996     0.001432872  3145  U   N [iou-sqp-3144] 32

Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size.  We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem.  Instead, let's just give it a more sensible value
that will allow for more efficient batching.  I've tested with different
cap values, and initially proposed to increase the cap to 1024.  Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.

Fixes: af5d68f889 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:56:53 -06:00
..
advise.c io_uring/advise: support 64-bit lengths 2024-06-16 14:54:55 -06:00
advise.h
alloc_cache.c io_uring: add alloc_cache.c 2025-01-28 15:10:40 -07:00
alloc_cache.h io_uring/net: convert to struct iou_vec 2025-03-07 13:41:08 -07:00
cancel.c io_uring/cancel: add generic cancel helper 2025-02-17 05:34:45 -07:00
cancel.h io_uring/cancel: add generic cancel helper 2025-02-17 05:34:45 -07:00
epoll.c io_uring/epoll: add support for IORING_OP_EPOLL_WAIT 2025-02-20 07:59:56 -07:00
epoll.h io_uring/epoll: add support for IORING_OP_EPOLL_WAIT 2025-02-20 07:59:56 -07:00
eventfd.c io_uring/eventfd: ensure io_eventfd_signal() defers another RCU period 2025-01-09 07:16:45 -07:00
eventfd.h io_uring/eventfd: move eventfd handling to separate file 2024-06-16 14:54:55 -06:00
fdinfo.c io_uring/fdinfo: annotate racy sq/cq head/tail reads 2025-04-30 07:17:17 -06:00
fdinfo.h
filetable.c io_uring: cache nodes and mapped buffers 2025-02-28 07:05:46 -07:00
filetable.h io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers 2024-11-07 15:24:33 -07:00
fs.c io_uring/fs: consider link->flags when getting path for LINKAT 2023-11-20 09:01:42 -07:00
fs.h
futex.c io_uring: introduce io_cache_free() helper 2025-03-05 07:38:55 -07:00
futex.h io_uring: move cancelations to be io_uring_task based 2024-11-06 13:55:38 -07:00
io_uring.c io_uring: ensure deferred completions are flushed for multishot 2025-05-07 07:55:15 -06:00
io_uring.h io_uring: don't pass ctx to tw add remote helper 2025-03-28 17:14:01 -06:00
io-wq.c Merge branch 'io_uring-6.14' into for-6.15/io_uring 2025-02-27 07:18:01 -07:00
io-wq.h io_uring/io-wq: cache work->flags in variable 2025-02-17 05:34:45 -07:00
kbuf.c io_uring/kbuf: reject zero sized provided buffers 2025-04-07 07:51:23 -06:00
kbuf.h io_uring/kbuf: uninline __io_put_kbufs 2025-02-17 05:34:45 -07:00
Kconfig io_uring: make zcrx depend on CONFIG_IO_URING 2025-03-31 07:07:44 -06:00
Makefile io_uring/epoll: remove CONFIG_EPOLL guards 2025-02-20 07:59:56 -07:00
memmap.c io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap 2025-02-21 09:40:09 -07:00
memmap.h io_uring/zcrx: add interface queue and refill queue 2025-02-17 05:41:03 -07:00
msg_ring.c io_uring: don't pass ctx to tw add remote helper 2025-03-28 17:14:01 -06:00
msg_ring.h io_uring/msg_ring: Drop custom destructor 2024-12-27 10:08:21 -07:00
napi.c net: use napi_id_valid helper 2025-02-17 16:43:04 -08:00
napi.h io_uring/napi: add static napi tracking strategy 2024-11-06 13:55:38 -07:00
net.c io_uring/net: avoid import_ubuf for regvec send 2025-03-31 12:41:49 -06:00
net.h io_uring/net: convert to struct iou_vec 2025-03-07 13:41:08 -07:00
nop.c io_uring/nop: use io_find_buf_node() 2025-02-28 19:35:37 -07:00
nop.h
notif.c io_uring: introduce type alias for io_tw_state 2025-02-17 05:34:50 -07:00
notif.h io_uring/notif: implement notification stacking 2024-04-22 19:31:18 -06:00
opdef.c for-6.15/io_uring-epoll-wait-20250325 2025-03-28 14:55:32 -07:00
opdef.h io_uring: rearrange opdef flags by use pattern 2025-02-27 07:27:56 -07:00
openclose.c io_uring: enable audit and restrict cred override for IORING_OP_FIXED_FD_INSTALL 2024-01-23 15:25:14 -07:00
openclose.h io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL 2023-12-12 07:42:57 -07:00
poll.c io_uring: unify STOP_MULTISHOT with IOU_OK 2025-03-10 07:14:18 -06:00
poll.h io_uring: introduce type alias for io_tw_state 2025-02-17 05:34:50 -07:00
refs.h io_uring: always do atomic put from iowq 2025-04-03 08:31:57 -06:00
register.c io_uring/zcrx: add interface queue and refill queue 2025-02-17 05:41:03 -07:00
register.h io_uring: temporarily disable registered waits 2024-11-15 09:58:34 -07:00
rsrc.c io_uring/rsrc: ensure segments counts are correct on kbuf buffers 2025-04-17 11:59:12 -06:00
rsrc.h for-6.15/io_uring-rx-zc-20250325 2025-03-28 13:45:52 -07:00
rw.c for-6.15/io_uring-reg-vec-20250327 2025-03-28 15:07:04 -07:00
rw.h io_uring/rw: implement vectored registered rw 2025-03-07 09:07:29 -07:00
slist.h io_uring: silence variable ‘prev’ set but not used warning 2023-03-09 10:10:58 -07:00
splice.c io_uring/rsrc: avoid NULL check in io_put_rsrc_node() 2025-02-17 05:34:46 -07:00
splice.h io_uring/splice: open code 2nd direct file assignment 2024-10-29 13:43:28 -06:00
sqpoll.c io_uring/sqpoll: Increase task_work submission batch size 2025-05-09 07:56:53 -06:00
sqpoll.h io_uring/sqpoll: statistics of the true utilization of sq threads 2024-03-01 06:28:19 -07:00
statx.c io_statx_prep(): use getname_uflags() 2024-11-13 11:44:30 -05:00
statx.h
sync.c io_uring: for requests that require async, force it 2023-01-29 15:18:26 -07:00
sync.h
tctx.c io_uring/tctx: work around xa_store() allocation error issue 2024-11-29 07:20:28 -07:00
tctx.h io_uring: simplify __io_uring_add_tctx_node 2022-10-07 12:25:30 -06:00
timeout.c for-6.15/io_uring-20250322 2025-03-26 17:56:00 -07:00
timeout.h io_uring: move cancelations to be io_uring_task based 2024-11-06 13:55:38 -07:00
truncate.c io_uring: add support for ftruncate 2024-02-09 09:04:39 -07:00
truncate.h io_uring: add support for ftruncate 2024-02-09 09:04:39 -07:00
uring_cmd.c io_uring: cleanup {g,s]etsockopt sqe reading 2025-03-31 07:08:46 -06:00
uring_cmd.h io_uring: hide caches sqes from drivers 2025-03-31 07:08:34 -06:00
waitid.c io_uring/waitid: use io_is_compat() 2025-02-24 12:10:38 -07:00
waitid.h io_uring: move cancelations to be io_uring_task based 2024-11-06 13:55:38 -07:00
xattr.c replace do_getxattr() with saner helpers. 2024-11-06 12:59:39 -05:00
xattr.h
zcrx.c io_uring/zcrx: fix late dma unmap for a dead dev 2025-04-18 06:12:10 -06:00
zcrx.h io_uring/zcrx: fix late dma unmap for a dead dev 2025-04-18 06:12:10 -06:00