2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

623 Commits

Author SHA1 Message Date
Jens Axboe
079afb081c io_uring/futex: mark wait requests as inflight
Inflight marking is used so that do_exit() -> io_uring_files_cancel()
will find requests with files that reference an io_uring instance,
so they can get appropriately canceled before the files go away.
However, it's also called before the mm goes away.

Mark futex/futexv wait requests as being inflight, so that
io_uring_files_cancel() will prune them. This ensures that the mm stays
alive, which is important as an exiting mm will also free the futex
private hash buckets. An io_uring futex request with FUTEX2_PRIVATE
set relies on those being alive until the request has completed. A
recent commit added these futex private hashes, which get killed when
the mm goes away.

Fixes: 80367ad01d ("futex: Add basic infrastructure for local task local hash")
Link: https://lore.kernel.org/io-uring/38053.1749045482@localhost/
Reported-by: Robert Morris <rtm@csail.mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-04 10:50:14 -06:00
Linus Torvalds
49fffac983 for-6.16/io_uring-20250523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnDgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgHZEADA1ym0ihHRjU2kTlXXOdkLLOl+o1RCHUjr
 KNf6sELGgyDC5FL/hAWdsjonInY4MLbJW0eNHEuuK8iFcn3wSHuHPXhRJXx/4cOs
 GGVLTd+Jm8ih4UL/GeLrBe3ehW9UUOtz1TCYzho0bdXHQWjruCFTqB5OzPQFMGQW
 R/lwXVNfjgGno5JhBnsrwz3ZnAfAnJhxqmc0GFHaa/nVF1OREYW/HS75EPFNiFgp
 Aevilw5QyrA2gDlZ+zCUwaGKAEl32yZCI6LZpI4kMtPK1reEbgFTrzIaCZ/OZCYM
 DVdBVEeuOmcBYIKbitD/+fcLNXHMrSJSWvUSXR4GuRNVkCTIAcEMKM2bX8VY7gmJ
 7ZQIo0EL2mSwmewHIYnvf9w/qrNYR0NyUt2v4U4rA2wj6e5w1EYMriP94wKdBGvD
 RNxja429N3fg3aBIkdQ6iYSVJRgE7DCo7dnKrEqglZPb32LOiNoOoou9shI5tb25
 8X7u0HzbpwKY/XByXZ2IaX7PYK2iFqkJjFYlGehtF97W85LGEvkDFU6fcBdjBO8r
 umgeE5O+lR+cf68JTJ6P34A7bBg71AXO3ytIuWunG56/0yu/FHDCjhBWE5ZjEhGR
 u2YhAGPRDQsJlSlxx8TXoKyYWP55NqdeyxYrmku/fZLn5WNVXOFeRlUDAZsF7mU7
 nuiOt9j4WA==
 =k8SF
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/io_uring-20250523' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Avoid indirect function calls in io-wq for executing and freeing
   work.

   The design of io-wq is such that it can be a generic mechanism, but
   as it's just used by io_uring now, may as well avoid these indirect
   calls

 - Clean up registered buffers for networking

 - Add support for IORING_OP_PIPE. Pretty straight forward, allows
   creating pipes with io_uring, particularly useful for having these be
   instantiated as direct descriptors

 - Clean up the coalescing support fore registered buffers

 - Add support for multiple interface queues for zero-copy rx
   networking. As this feature was merged for 6.15 it supported just a
   single ifq per ring

 - Clean up the eventfd support

 - Add dma-buf support to zero-copy rx

 - Clean up and improving the request draining support

 - Clean up provided buffer support, most notably with an eye toward
   making the legacy support less intrusive

 - Minor fdinfo cleanups, dropping support for dumping what credentials
   are registered

 - Improve support for overflow CQE handling, getting rid of GFP_ATOMIC
   for allocating overflow entries where possible

 - Improve detection of cases where io-wq doesn't need to spawn a new
   worker unnecessarily

 - Various little cleanups

* tag 'for-6.16/io_uring-20250523' of git://git.kernel.dk/linux: (59 commits)
  io_uring/cmd: warn on reg buf imports by ineligible cmds
  io_uring/io-wq: only create a new worker if it can make progress
  io_uring/io-wq: ignore non-busy worker going to sleep
  io_uring/io-wq: move hash helpers to the top
  trace/io_uring: fix io_uring_local_work_run ctx documentation
  io_uring: finish IOU_OK -> IOU_COMPLETE transition
  io_uring: add new helpers for posting overflows
  io_uring: pass in struct io_big_cqe to io_alloc_ocqe()
  io_uring: make io_alloc_ocqe() take a struct io_cqe pointer
  io_uring: split alloc and add of overflow
  io_uring: open code io_req_cqe_overflow()
  io_uring/fdinfo: get rid of dumping credentials
  io_uring/fdinfo: only compile if CONFIG_PROC_FS is set
  io_uring/kbuf: unify legacy buf provision and removal
  io_uring/kbuf: refactor __io_remove_buffers
  io_uring/kbuf: don't compute size twice on prep
  io_uring/kbuf: drop extra vars in io_register_pbuf_ring
  io_uring/kbuf: use mem_is_zero()
  io_uring/kbuf: account ring io_buffer_list memory
  io_uring: drain based on allocates reqs
  ...
2025-05-26 12:13:22 -07:00
Linus Torvalds
6f59de9bc0 for-6.16/block-20250523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X
 MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q
 IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy
 fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO
 xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW
 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH
 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R
 Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn
 i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X
 chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8
 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb
 Y6NDJw5+kQ==
 =1PNK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - ublk updates:
      - Add support for updating the size of a ublk instance
      - Zero-copy improvements
      - Auto-registering of buffers for zero-copy
      - Series simplifying and improving GET_DATA and request lookup
      - Series adding quiesce support
      - Lots of selftests additions
      - Various cleanups

 - NVMe updates via Christoph:
      - add per-node DMA pools and use them for PRP/SGL allocations
        (Caleb Sander Mateos, Keith Busch)
      - nvme-fcloop refcounting fixes (Daniel Wagner)
      - support delayed removal of the multipath node and optionally
        support the multipath node for private namespaces (Nilay Shroff)
      - support shared CQs in the PCI endpoint target code (Wilfred
        Mallawa)
      - support admin-queue only authentication (Hannes Reinecke)
      - use the crc32c library instead of the crypto API (Eric Biggers)
      - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
        Reinecke, Leon Romanovsky, Gustavo A. R. Silva)

 - MD updates via Yu:
      - Fix that normal IO can be starved by sync IO, found by mkfs on
        newly created large raid5, with some clean up patches for bdev
        inflight counters

 - Clean up brd, getting rid of atomic kmaps and bvec poking

 - Add loop driver specifically for zoned IO testing

 - Eliminate blk-rq-qos calls with a static key, if not enabled

 - Improve hctx locking for when a plug has IO for multiple queues
   pending

 - Remove block layer bouncing support, which in turn means we can
   remove the per-node bounce stat as well

 - Improve blk-throttle support

 - Improve delay support for blk-throttle

 - Improve brd discard support

 - Unify IO scheduler switching. This should also fix a bunch of lockdep
   warnings we've been seeing, after enabling lockdep support for queue
   freezing/unfreezeing

 - Add support for block write streams via FDP (flexible data placement)
   on NVMe

 - Add a bunch of block helpers, facilitating the removal of a bunch of
   duplicated boilerplate code

 - Remove obsolete BLK_MQ pci and virtio Kconfig options

 - Add atomic/untorn write support to blktrace

 - Various little cleanups and fixes

* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
  selftests: ublk: add test for UBLK_F_QUIESCE
  ublk: add feature UBLK_F_QUIESCE
  selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
  traceevent/block: Add REQ_ATOMIC flag to block trace events
  ublk: run auto buf unregisgering in same io_ring_ctx with registering
  io_uring: add helper io_uring_cmd_ctx_handle()
  ublk: remove io argument from ublk_auto_buf_reg_fallback()
  ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
  selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
  selftests: ublk: support UBLK_F_AUTO_BUF_REG
  ublk: support UBLK_AUTO_BUF_REG_FALLBACK
  ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
  ublk: prepare for supporting to register request buffer automatically
  ublk: convert to refcount_t
  selftests: ublk: make IO & device removal test more stressful
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  nvme: introduce multipath_always_on module param
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme-pci: derive and better document max segments limits
  nvme-pci: use struct_size for allocation struct nvme_dev
  ...
2025-05-26 11:39:36 -07:00
Jens Axboe
8bb9d6ccd3 io_uring: finish IOU_OK -> IOU_COMPLETE transition
IOU_COMPLETE is more descriptive, in that it explicitly says that the
return value means "please post a completion for this request". This
patch completes the transition from IOU_OK to IOU_COMPLETE, replacing
existing IOU_OK users.

This is a purely mechanical change.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-21 08:41:16 -06:00
Pavel Begunkov
a7d755ed9c io_uring: fix overflow resched cqe reordering
Leaving the CQ critical section in the middle of a overflow flushing
can cause cqe reordering since the cache cq pointers are reset and any
new cqe emitters that might get called in between are not going to be
forced into io_cqe_cache_refill().

Fixes: eac2ca2d68 ("io_uring: check if we need to reschedule during overflow flush")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/90ba817f1a458f091f355f407de1c911d2b93bbf.1747483784.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-21 07:01:54 -06:00
Jens Axboe
f660fd2ca1 io_uring: add new helpers for posting overflows
Add two helpers, one for posting overflows for lockless_cq rings, and
one for non-lockless_cq rings. The former can allocate sanely with
GFP_KERNEL, but needs to grab the completion lock for posting, while the
latter must do non-sleeping allocs as it already holds the completion
lock.

While at it, mark the overflow handling functions as __cold as well, as
they should not generally be called during normal operations of the
ring.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:47:19 -06:00
Jens Axboe
c80bdb1c55 io_uring: pass in struct io_big_cqe to io_alloc_ocqe()
Rather than pass extra1/extra2 separately, just pass in the (now) named
io_big_cqe struct instead. The callers that don't use/support CQE32 will
now just pass a single NULL, rather than two seperate mystery zero
values.

Move the clearing of the big_cqe elements into io_alloc_ocqe() as well,
so it can get moved out of the generic code.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:47:18 -06:00
Jens Axboe
072d37b52c io_uring: make io_alloc_ocqe() take a struct io_cqe pointer
The number of arguments to io_alloc_ocqe() is a bit unwieldy. Make it
take a struct io_cqe pointer rather than three separate CQE args. One
path already has that readily available, add an io_init_cqe() helper for
the remaining two.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:46:53 -06:00
Jens Axboe
10f466abc4 io_uring: split alloc and add of overflow
Add a new helper, io_alloc_ocqe(), that simply allocates and fills an
overflow entry. Then it can get done outside of the locking section,
and hence use more appropriate gfp_t allocation flags rather than always
default to GFP_ATOMIC.

Inspired by a previous series from Pavel:

https://lore.kernel.org/io-uring/cover.1747209332.git.asml.silence@gmail.com/

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:46:50 -06:00
Pavel Begunkov
5288b9e28f io_uring: open code io_req_cqe_overflow()
A preparation patch, just open code io_req_cqe_overflow().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16 12:38:36 -06:00
Jens Axboe
3de7361f7c Merge branch 'io_uring-6.15' into for-6.16/io_uring
Merge in 6.15 io_uring fixes, mostly so that the fdinfo changes can
get easily extended without causing merge conflicts.

* io_uring-6.15:
  io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo()
  io_uring/memmap: don't use page_address() on a highmem page
  io_uring/uring_cmd: fix hybrid polling initialization issue
  io_uring/sqpoll: Increase task_work submission batch size
  io_uring: ensure deferred completions are flushed for multishot
  io_uring: always arm linked timeouts prior to issue
  io_uring/fdinfo: annotate racy sq/cq head/tail reads
  io_uring: fix 'sync' handling of io_fallback_tw()
  io_uring: don't duplicate flushing in io_req_post_cqe
2025-05-16 12:31:19 -06:00
Pavel Begunkov
8fb7aee055 io_uring: drain based on allocates reqs
Don't rely on CQ sequence numbers for draining, as it has become messy
and needs cq_extra adjustments. Instead, base it on the number of
allocated requests and only allow flushing when all requests are in the
drain list.

As a result, cq_extra is gone, no overhead for its accounting in aux cqe
posting, less bloating as it was inlined before, and it's in general
simpler than trying to track where we should bump it and where it should
be put back like in cases of overflow. Also, it'll likely help with
cleaning and unifying some of the CQ posting helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/46ece1e34320b046c06fee2498d6b4cd12a700f2.1746788718.git.asml.silence@gmail.com
Link: https://lore.kernel.org/r/24497b04b004bceada496033d3c9d09ff8e81ae9.1746944903.git.asml.silence@gmail.com
[axboe: fold in fix from link2]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12 07:52:52 -06:00
Pavel Begunkov
63de899cb6 io_uring: count allocated requests
Keep track of the number requests a ring currently has allocated (and
not freed), it'll be needed in the next patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c8f8308294dc2a1cb8925d984d937d4fc14ab5d4.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:02 -06:00
Pavel Begunkov
b0c8a6401f io_uring: open code io_account_cq_overflow()
io_account_cq_overflow() doesn't help explaining what's going on in
there, and it'll become even smaller with following patches, so open
code it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4333fa0d371f519e52a71148ebdffed4b8d3aa9.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:02 -06:00
Pavel Begunkov
19a94da447 io_uring: consolidate drain seq checking
We check sequences when queuing drained requests as well when flushing
them. Instead, always queue and immediately try to flush, so that all
seq handling can be kept contained in the flushing code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d4651f742e671af5b3216581e539ea5d31bc7125.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:02 -06:00
Pavel Begunkov
e91e4f692f io_uring: remove drain prealloc checks
Currently io_drain_req() has two steps. The first is fast path checking
sequence numbers. The second is allocations, rechecking and actual
queuing. Further simplify it by removing the first step.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d06e89ed07611993d7bf89182de2300858379bd.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov
05b334110f io_uring: simplify drain ret passing
"ret" in io_drain_req() is only used in one place, remove it and pass
-ENOMEM directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ece724b77e66e6caabcc215e0032ee7ff140f289.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov
fde04c7e27 io_uring: fix spurious drain flushing
io_queue_deferred() is not tolerant to spurious calls not completing
some requests. You can have an inflight drain-marked request and another
request that came after and got queued into the drain list. Now, if
io_queue_deferred() is called before the first request completes, it'll
check the 2nd req with req_need_defer(), find that there is no drain
flag set, and queue it for execution.

To make io_queue_deferred() work, it should at least check sequences for
the first request, and then we need also need to check if there is
another drain request creating another bubble.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/972bde11b7d4ef25b3f5e3fd34f80e4d2aa345b8.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov
f979c20547 io_uring: account drain memory to cgroup
Account drain allocations against memcg. It's not a big problem as each
such allocation is paired with a request, which is accounted, but it's
nicer to follow the limits more closely.

Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f8dfdbd755c41fd9c75d12b858af07dfba5bbb68.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov
81a22c86ec io_uring: add lockdep asserts to io_add_aux_cqe
io_add_aux_cqe() can only be called for rings with uring_lock protected
completion queues, add a couple of assertions in regards to that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c010eab7b94a187c00a9d46d8b67bf7fcad18af4.1746788592.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:58:55 -06:00
Jens Axboe
687b2bae0e io_uring: ensure deferred completions are flushed for multishot
Multishot normally uses io_req_post_cqe() to post completions, but when
stopping it, it may finish up with a deferred completion. This is fine,
except if another multishot event triggers before the deferred completions
get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
as new multishot completions get posted before the deferred ones are
flushed. This can cause confusion on the application side, if strict
ordering is required for the use case.

When multishot posting via io_req_post_cqe(), flush any pending deferred
completions first, if any.

Cc: stable@vger.kernel.org # 6.1+
Reported-by: Norman Maurer <norman_maurer@apple.com>
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:55:15 -06:00
Pavel Begunkov
35adea1d01 io_uring: move io_req_put_rsrc_nodes()
It'd be nice to hide details of how rsrc nodes are used by a request
from rsrc.c, specifically which request fields store them, and what bits
are signifying if there is a node in a request. It rather belong to
generic request handling, so move the helper to io_uring.c. While doing
so remove clearing of ->buf_node as it's controlled by REQ_F_BUF_NODE
and doesn't require zeroing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bb73fb42baf825edb39344365aff48cdfdd4c692.1746533789.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:23 -06:00
Pavel Begunkov
9c2ff3f9b5 io_uring: remove io_preinit_req()
Apart from setting ->ctx, io_preinit_req() zeroes a bunch of fields of a
request, from which only ->file_node is mandatory. Remove the function
and zero the entire request on first allocation. With that, we also need
to initialise ->ctx every time, which might be a good thing for
performance as now we're likely overwriting the entire cache line, and
so it can write combined and avoid RMW.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ba5485dc913f1e275862ce88f5169d4ac4a33836.1746533807.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:23 -06:00
Keith Busch
02040353f4 io_uring: enable per-io write streams
Allow userspace to pass a per-I/O write stream in the SQE:

      __u8 write_stream;

The __u8 type matches the size the filesystems and block layer support.

Application can query the supported values from the block devices
max_write_streams sysfs attribute. Unsupported values are ignored by
file operations that do not support write streams or rejected with an
error by those that support them.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 07:46:43 -06:00
Jens Axboe
b53e523261 io_uring: always arm linked timeouts prior to issue
There are a few spots where linked timeouts are armed, and not all of
them adhere to the pre-arm, attempt issue, post-arm pattern. This can
be problematic if the linked request returns that it will trigger a
callback later, and does so before the linked timeout is fully armed.

Consolidate all the linked timeout handling into __io_issue_sqe(),
rather than have it spread throughout the various issue entry points.

Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/1390
Reported-by: Chase Hiltz <chase@path.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-04 09:15:58 -06:00
Pavel Begunkov
27d2fed790 io_uring: delete misleading comment in io_fill_cqe_aux()
io_fill_cqe_aux() doesn't overflow completions, however it might fail
them and lets the caller handle it. Remove the comment, which doesn't
make any sense.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/021aa8c1d8f20ef2b66da6aeabb6b511938fd2c5.1745843119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28 11:51:31 -06:00
Jens Axboe
edd43f4d6f io_uring: fix 'sync' handling of io_fallback_tw()
A previous commit added a 'sync' parameter to io_fallback_tw(), which if
true, means the caller wants to wait on the fallback thread handling it.
But the logic is somewhat messed up, ensure that ctxs are swapped and
flushed appropriately.

Cc: stable@vger.kernel.org
Fixes: dfbe5561ae ("io_uring: flush offloaded and delayed task_work on exit")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 10:32:43 -06:00
Pavel Begunkov
62f666df76 io_uring/eventfd: dedup signalling helpers
Consolidate io_eventfd_flush_signal() and io_eventfd_signal(). Not much
of a difference for now, but it prepares it for following changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5beecd4da65d8d2d83df499196f84b329387f6a2.1745493845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 08:33:54 -06:00
Pavel Begunkov
5e16f1a68d io_uring: don't duplicate flushing in io_req_post_cqe
io_req_post_cqe() sets submit_state.cq_flush so that
*flush_completions() can take care of batch commiting CQEs. Don't commit
it twice by using __io_cq_unlock_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/41c416660c509cee676b6cad96081274bcb459f3.1745493861.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 06:28:43 -06:00
Pavel Begunkov
76f1cc98b2 io_uring/zcrx: add support for multiple ifqs
Allow the user to register multiple ifqs / zcrx contexts. With that we
can use multiple interfaces / interface queues in a single io_uring
instance.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/668b03bee03b5216564482edcfefbc2ee337dd30.1745141261.git.asml.silence@gmail.com
[axboe: fold in fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23 07:13:14 -06:00
Jens Axboe
8a2dacd49f io_uring/rsrc: remove node assignment helpers
There are two helpers here, one assigns and increments the node ref
count, and the other is simply a wrapper around that for the buffer node
handling.

The buffer node assignment benefits from checking and setting
REQ_F_BUF_NODE together, otherwise stalls have been observed on setting
that flag later in the process. Hence re-do it so that it's set when
checked, and cleared in case of (unlikely) failure. With that, the
buffer node helper can go, and then drop the generic
io_req_assign_rsrc_node() helper as well as there's only a single user
of it left at that point.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Caleb Sander Mateos
9fe99eed91 io_uring/wq: avoid indirect do_work/free_work calls
struct io_wq stores do_work and free_work function pointers which are
called on each work item. But these function pointers are always set to
io_wq_submit_work and io_wq_free_work, respectively. So remove these
function pointers and just call the functions directly.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250329161527.3281314-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Pavel Begunkov
390513642e io_uring: always do atomic put from iowq
io_uring always switches requests to atomic refcounting for iowq
execution before there is any parallilism by setting REQ_F_REFCOUNT,
and the flag is not cleared until the request completes. That should be
fine as long as the compiler doesn't make up a non existing value for
the flags, however KCSAN still complains when the request owner changes
oter flag bits:

BUG: KCSAN: data-race in io_req_task_cancel / io_wq_free_work
...
read to 0xffff888117207448 of 8 bytes by task 3871 on cpu 0:
 req_ref_put_and_test io_uring/refs.h:22 [inline]

Skip REQ_F_REFCOUNT checks for iowq, we know it's set.

Reported-by: syzbot+903a2ad71fb3f1e47cf5@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d880bc27fb8c3209b54641be4ff6ac02b0e5789a.1743679736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-03 08:31:57 -06:00
Pavel Begunkov
697b2876ac io_uring: add req flag invariant build assertion
We're caching some of file related request flags in a tricky way, put
a build check to make sure flags don't get reshuffled.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9877577b83c25dd78224a8274f799187e7ec7639.1743407551.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31 07:07:34 -06:00
Pavel Begunkov
ea9106786e io_uring: don't pass ctx to tw add remote helper
Unlike earlier versions, io_msg_remote_post() creates a valid request
with a proper context, so don't pass a context to
io_req_task_work_add_remote() explicitly but derive it from the request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/721f51cf34996d98b48f0bfd24ad40aa2730167e.1743190078.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:14:01 -06:00
Linus Torvalds
eff5f16bfd for-6.15/io_uring-reg-vec-20250327
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmflYcAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmvJD/4tKQlr0yRhln/JzPiONS41mUAuNRI4MdqJ
 ykpQkMx3NcQANbNyOxI0PV5I7y1Jdlg/UP9gy11BrIaBk4Kqoluc6iAzgr5q9pBC
 8pXhPIe80R/q/LOKEz9n5gqOMPNyUtd7IaBayJPBJre/YZXQu+49IL2Uyy3hss8d
 neqAbWErd2FoUfTY14XB3ImLM6a76Z6CjE3pJYvVDM5uRBuH0IGqehJJuNpsViBf
 M9XAW/HZt8ISsVt1tJbCQVWx4b63L/omHI8u5K2M0isTPV+QPk1O2Vgkn7dBrDeT
 JvThWrM1uE++DYGcQ3DXHfb3gBIFEjTrNb2nddstyEU2ZaEXUkuOV2O0b7WPuphj
 zp0oFaLl/ivHT8NoJzzZzK24zt99Qz43GWUaFCQeR0R8oTix/M1q0unguER45Iv7
 Po/b3h6+RAi+87KOlM5WWo05ScswS8AwcSUsP5xMR5BjjD+GQYO5PmVVyo8w0rid
 8F9U9DpN2CTA5YVjI+ax1cxWMOfmAXPK5ONjzZpyJoWb0THgj97esEwc2un7SBi7
 TJJz7Gc9/xOqfRKaPDoH9t8+b6ruWHMqCYDw6exSAUKeDxQ+7z0zNMudHkuR5VrX
 x+Taaj95ONLVNZYz0mbFcvmJC0UBOqkE94omXk7TU2Cn7SBzAW//XDep6CPpX/sa
 LcmOK4UXdg==
 =vOm1
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
 "Final separate updates for io_uring.

  This started out as a series of cleanups improvements and improvements
  for registered buffers, but as the last series of the io_uring changes
  for 6.15, it also collected a few fixes for the other branches on top:

   - Add support for vectored fixed/registered buffers.

     Previously only single segments have been supported for commands,
     now vectored variants are supported as well. This series includes
     networking and file read/write support.

   - Small series unifying return codes across multi and single shot.

   - Small series cleaning up registerd buffer importing.

   - Adding support for vectored registered buffers for uring_cmd.

   - Fix for io-wq handling of command reissue.

   - Various little fixes and tweaks"

* tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux: (25 commits)
  io_uring/net: fix io_req_post_cqe abuse by send bundle
  io_uring/net: use REQ_F_IMPORT_BUFFER for send_zc
  io_uring: move min_events sanitisation
  io_uring: rename "min" arg in io_iopoll_check()
  io_uring: open code __io_post_aux_cqe()
  io_uring: defer iowq cqe overflow via task_work
  io_uring: fix retry handling off iowq
  io_uring/net: only import send_zc buffer once
  io_uring/cmd: introduce io_uring_cmd_import_fixed_vec
  io_uring/cmd: add iovec cache for commands
  io_uring/cmd: don't expose entire cmd async data
  io_uring: rename the data cmd cache
  io_uring: rely on io_prep_reg_vec for iovec placement
  io_uring: introduce io_prep_reg_iovec()
  io_uring: unify STOP_MULTISHOT with IOU_OK
  io_uring: return -EAGAIN to continue multishot
  io_uring: cap cached iovec/bvec size
  io_uring/net: implement vectored reg bufs for zctx
  io_uring/net: convert to struct iou_vec
  io_uring/net: pull vec alloc out of msghdr import
  ...
2025-03-28 15:07:04 -07:00
Linus Torvalds
ca0b04ba0b for-6.15/io_uring-rx-zc-20250325
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfjTP8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm6oEACnpGL52FAKTVj14GDqFo6Pq0Jmnh07x8qj
 mpHFPwxfWAzRiuNyji2iS9ecS2cnlkixNyMWZipXRi4KJAUjJH6YDd7IofUI3Glf
 6v7b6srFSvsWJIJ8LdkJHLHAJuzYnJvFZ8apwgQczEDqgHq7BAunM1sVQ+mydjYk
 EXT4kN6DSBOPzwr9GAay52f8nXhbqdHfT+YTGHPHg+QToojL6gD7vvW57w/QqD/x
 91hJef1z01cSIsDZOxA0EUeD+9bBAHpoamr/e3IOOCVYCN6hy0dGa9g0QGbbpVyE
 AeU4FGZLV9J8OOfvHVraDt5Wn3IXxYaMu22dSn1S6tVinwjXhJR2LAA+t4fGHAkt
 i38LjOsIbopSQn/cNhzwC8UZcHLqnVsdDolHlHzsVFVfcpck2/4JFpUeP8QhWgrk
 f9tY12QUf/oEaWm0/sUCHZNFxpIGeFA5FFXf0Z92clnzBuiuWoesBNvxqY/2DeZn
 IDNXiv+Trxr6kFEjTpzPeuxbWrn4PJ7afQSAFcEmOCguk5riM+zJZNIKg0TxUHSS
 tt6sfxmTP1DhgDKad5kT3MLyzOcx47Kbjf4dj6KmRnD+3DGwwN2F7X7R1GJylPSp
 RLOzJ+Ouuy9UmBN6JMsT4BmR9+FJTVirADU926d/ZqCTtRV8Tnq/6HHmKmmr4CR0
 THJ0PJqQjg==
 =MOve
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux

Pull io_uring zero-copy receive support from Jens Axboe:
 "This adds support for zero-copy receive with io_uring, enabling fast
  bulk receive of data directly into application memory, rather than
  needing to copy the data out of kernel memory.

  While this version only supports host memory as that was the initial
  target, other memory types are planned as well, with notably GPU
  memory coming next.

  This work depends on some networking components which were queued up
  on the networking side, but have now landed in your tree.

  This is the work of Pavel Begunkov and David Wei. From the v14 posting:

    'We configure a page pool that a driver uses to fill a hw rx queue
     to hand out user pages instead of kernel pages. Any data that ends
     up hitting this hw rx queue will thus be dma'd into userspace
     memory directly, without needing to be bounced through kernel
     memory. 'Reading' data out of a socket instead becomes a
     _notification_ mechanism, where the kernel tells userspace where
     the data is. The overall approach is similar to the devmem TCP
     proposal

     This relies on hw header/data split, flow steering and RSS to
     ensure packet headers remain in kernel memory and only desired
     flows hit a hw rx queue configured for zero copy. Configuring this
     is outside of the scope of this patchset.

     We share netdev core infra with devmem TCP. The main difference is
     that io_uring is used for the uAPI and the lifetime of all objects
     are bound to an io_uring instance. Data is 'read' using a new
     io_uring request type. When done, data is returned via a new shared
     refill queue. A zero copy page pool refills a hw rx queue from this
     refill queue directly. Of course, the lifetime of these data
     buffers are managed by io_uring rather than the networking stack,
     with different refcounting rules.

     This patchset is the first step adding basic zero copy support. We
     will extend this iteratively with new features e.g. dynamically
     allocated zero copy areas, THP support, dmabuf support, improved
     copy fallback, general optimisations and more'

  In a local setup, I was able to saturate a 200G link with a single CPU
  core, and at netdev conf 0x19 earlier this month, Jamal reported
  188Gbit of bandwidth using a single core (no HT, including soft-irq).

  Safe to say the efficiency is there, as bigger links would be needed
  to find the per-core limit, and it's considerably more efficient and
  faster than the existing devmem solution"

* tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux:
  io_uring/zcrx: add selftest case for recvzc with read limit
  io_uring/zcrx: add a read limit to recvzc requests
  io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap
  io_uring: Rename KConfig to Kconfig
  io_uring/zcrx: fix leaks on failed registration
  io_uring/zcrx: recheck ifq on shutdown
  io_uring/zcrx: add selftest
  net: add documentation for io_uring zcrx
  io_uring/zcrx: add copy fallback
  io_uring/zcrx: throttle receive requests
  io_uring/zcrx: set pp memory provider for an rx queue
  io_uring/zcrx: add io_recvzc request
  io_uring/zcrx: dma-map area for the device
  io_uring/zcrx: implement zerocopy receive pp memory provider
  io_uring/zcrx: grab a net device
  io_uring/zcrx: add io_zcrx_area
  io_uring/zcrx: add interface queue and refill queue
2025-03-28 13:45:52 -07:00
Pavel Begunkov
6889ae1b4d io_uring/net: fix io_req_post_cqe abuse by send bundle
[  114.987980][ T5313] WARNING: CPU: 6 PID: 5313 at io_uring/io_uring.c:872 io_req_post_cqe+0x12e/0x4f0
[  114.991597][ T5313] RIP: 0010:io_req_post_cqe+0x12e/0x4f0
[  115.001880][ T5313] Call Trace:
[  115.002222][ T5313]  <TASK>
[  115.007813][ T5313]  io_send+0x4fe/0x10f0
[  115.009317][ T5313]  io_issue_sqe+0x1a6/0x1740
[  115.012094][ T5313]  io_wq_submit_work+0x38b/0xed0
[  115.013223][ T5313]  io_worker_handle_work+0x62a/0x1600
[  115.013876][ T5313]  io_wq_worker+0x34f/0xdf0

As the comment states, io_req_post_cqe() should only be used by
multishot requests, i.e. REQ_F_APOLL_MULTISHOT, which bundled sends are
not. Add a flag signifying whether a request wants to post multiple
CQEs. Eventually REQ_F_APOLL_MULTISHOT should imply the new flag, but
that's left out for simplicity.

Cc: stable@vger.kernel.org
Fixes: a05d1f625c ("io_uring/net: support bundles for send")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b611dbb54d1cd47a88681f5d38c84d0c02bc563.1743067183.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-27 05:48:32 -06:00
Linus Torvalds
91928e0d3c for-6.15/io_uring-20250322
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfe8DcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprzNEADPOOfi05F0f4KOMLYOiRlB+D9yGtnbcu8k
 foPATU++L5fFDUpjAGobi38AXqHtF/Bx0CjCkPCvy4GjygcszYDY81FQmRlSFdeH
 CnYV/bm5myyupc8xANcWGJYpxf/ZAw7ugOAgcR3Xxk/qXdD5hYJ1jG4477o8v4IJ
 jyzE4vQCnWBogIoqpWTQyY3FSP5fdjeQlYdFGIFCx68rFTO9zEpuHhFr2OMfJtSu
 y1dLkeqbu1BZ0fhgRog9k9ZWk0uRQJI7d7EUXf9iwXOfkUyzCaV+nHvNGRdhJ/XO
 zj/qePw4lFyCgc+kuIejjIaPf/g84paCzshz/PIxOZTsvo6X78zj72RPM0Makhga
 amgtEqegWHtKSya1zew57oDwqwaIe6b+56PtNSP7K8IrwcJLheszoPKJIwiiynIL
 ZqvWseqYQXs58S9qZzSZlyEpwLVFcvyNt7Z/q7b56vhvXHabxOccPkUHt+UscVoQ
 qpG0R9+zmIiDfzY6Vfu7JyH2Uspdq3WjGYiaoVUko1rJXY+HFPd3be9pIbckyaA8
 ZQHIxB+h8IlxhpQ8HJKQpZhtYiluxEu6wDdvtpwC0KiGp5g4Q5eg9SkVyvETIEN8
 yW2tHqMAJJtJjDAPwJOV+GIyWQMIexzLmMP7Fq8R5VIoAox+Xn3flgNNEnMZdl/r
 kTc/hRjHvQ==
 =VEQr
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "This is the first of the io_uring pull requests for the 6.15 merge
  window, there will be others once the net tree has gone in. This
  contains:

   - Cleanup and unification of cancelation handling across various
     request types.

   - Improvement for bundles, supporting them both for incrementally
     consumed buffers, and for non-multishot requests.

   - Enable toggling of using iowait while waiting on io_uring events or
     not. Unfortunately this is still tied with CPU frequency boosting
     on short waits, as the scheduler side has not been very receptive
     to splitting the (useless) iowait stat from the cpufreq implied
     boost.

   - Add support for kbuf nodes, enabling zero-copy support for the ublk
     block driver.

   - Various cleanups for resource node handling.

   - Series greatly cleaning up the legacy provided (non-ring based)
     buffers. For years, we've been pushing the ring provided buffers as
     the way to go, and that is what people have been using. Reduce the
     complexity and code associated with legacy provided buffers.

   - Series cleaning up the compat handling.

   - Series improving and cleaning up the recvmsg/sendmsg iovec and msg
     handling.

   - Series of cleanups for io-wq.

   - Start adding a bunch of selftests. The liburing repository
     generally carries feature and regression tests for everything, but
     at least for ublk initially, we'll try and go the route of having
     it in selftests as well. We'll see how this goes, might decide to
     migrate more tests this way in the future.

   - Various little cleanups and fixes"

* tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits)
  selftests: ublk: add stripe target
  selftests: ublk: simplify loop io completion
  selftests: ublk: enable zero copy for null target
  selftests: ublk: prepare for supporting stripe target
  selftests: ublk: move common code into common.c
  selftests: ublk: increase max buffer size to 1MB
  selftests: ublk: add single sqe allocator helper
  selftests: ublk: add generic_01 for verifying sequential IO order
  selftests: ublk: fix starting ublk device
  io_uring: enable toggle of iowait usage when waiting on CQEs
  selftests: ublk: fix write cache implementation
  selftests: ublk: add variable for user to not show test result
  selftests: ublk: don't show `modprobe` failure
  selftests: ublk: add one dependency header
  io_uring/kbuf: enable bundles for incrementally consumed buffers
  Revert "io_uring/rsrc: simplify the bvec iter count calculation"
  selftests: ublk: improve test usability
  selftests: ublk: add stress test for covering IO vs. killing ublk server
  selftests: ublk: add one stress test for covering IO vs. removing device
  selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests
  ...
2025-03-26 17:56:00 -07:00
Linus Torvalds
054570267d lsm/stable-6.15 PR 20250323
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmfgWgMUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNW5RAAvCDq5gBtY0aTNlULe637EVLSh+t8
 PkSzHzu/NlzU6BfjtwSm2fuML8welTGxSwUPxUzMCI91gPdkGeFktefavT3xa+QI
 BHWROn7fEJ/KmRZvngPeIkgLr5xhF5nBJmc/Jw71qem20zRzNgJnpzMX16d10Phx
 dxd2xOO1qM3bv6Z9RcIssZRGaN+PHngpWWg+0B69XuaBUso87S6NDyKNn1XPmvoz
 as96k+Wk/xAZGVEeCbs/+H5rBx6DLg+FfTRa06Oh4BFsqedpkDPxLrTgCJGJkA0H
 dsK6O/993zvjx0Jn4ZPoJ9n35S82BmkCsz4bGq1xVl6FYUiMcm3/8yO41wllS+w4
 j+RlTU/RIdB7n8EKyMMl1hj1stTvt3Bi9F5Cbf7ZEv0snfR00K4KVpi17jnFjUHv
 kpOiEtXZb/NGQip7UAuUq0PisfqbiO4jJurYHRetDgv1WCy6+C8ufM5t6I+cnvmG
 VG+dlxcW+rDIn6bLRVuGi9TJRsQ6eox9ipa+qEKNNiOXgftELcgT7m74nAS5m0uv
 n5rDa221nPXecEB0X7d6YUFk711lly90dbelNeLrmv1w6jl8L1PpS1oBaW+UzGu9
 46eGBd6pzu9otvK9WVyDEdotDOCrgH0sd7pTetqDhLJZ7KrGwyyqO2gD/JroUKcC
 lnxBQwPnat86iI8=
 =oxfV
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - Various minor updates to the LSM Rust bindings

   Changes include marking trivial Rust bindings as inlines and comment
   tweaks to better reflect the LSM hooks.

 - Add LSM/SELinux access controls to io_uring_allowed()

   Similar to the io_uring_disabled sysctl, add a LSM hook to
   io_uring_allowed() to enable LSMs a simple way to enforce security
   policy on the use of io_uring. This pull request includes SELinux
   support for this new control using the io_uring/allowed permission.

 - Remove an unused parameter from the security_perf_event_open() hook

   The perf_event_attr struct parameter was not used by any currently
   supported LSMs, remove it from the hook.

 - Add an explicit MAINTAINERS entry for the credentials code

   We've seen problems in the past where patches to the credentials code
   sent by non-maintainers would often languish on the lists for
   multiple months as there was no one explicitly tasked with the
   responsibility of reviewing and/or merging credentials related code.

   Considering that most of the code under security/ has a vested
   interest in ensuring that the credentials code is well maintained,
   I'm volunteering to look after the credentials code and Serge Hallyn
   has also volunteered to step up as an official reviewer. I posted the
   MAINTAINERS update as a RFC to LKML in hopes that someone else would
   jump up with an "I'll do it!", but beyond Serge it was all crickets.

 - Update Stephen Smalley's old email address to prevent confusion

   This includes a corresponding update to the mailmap file.

* tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  mailmap: map Stephen Smalley's old email addresses
  lsm: remove old email address for Stephen Smalley
  MAINTAINERS: add Serge Hallyn as a credentials reviewer
  MAINTAINERS: add an explicit credentials entry
  cred,rust: mark Credential methods inline
  lsm,rust: reword "destroy" -> "release" in SecurityCtx
  lsm,rust: mark SecurityCtx methods inline
  perf: Remove unnecessary parameter of security check
  lsm: fix a missing security_uring_allowed() prototype
  io_uring,lsm,selinux: add LSM hooks for io_uring_setup()
  io_uring: refactor io_uring_allowed()
2025-03-25 15:44:19 -07:00
Pavel Begunkov
816619782b io_uring: move min_events sanitisation
iopoll and normal waiting already duplicate min_completion truncation,
so move them inside the corresponding routines.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/254adb289cc04638f25d746a7499260fa89a179e.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:38:01 -06:00
Pavel Begunkov
d73acd7af3 io_uring: rename "min" arg in io_iopoll_check()
Don't name arguments "min", it shadows the namesake function.
min_events is also more consistent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f52ce9d88d3bca5732a218b0da14924aa6968909.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:57 -06:00
Pavel Begunkov
4c76de42cb io_uring: open code __io_post_aux_cqe()
There is no reason to keep __io_post_aux_cqe() separately from
io_post_aux_cqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2c4c1f68d694deea25a212fc09bbb11f330cd82e.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:52 -06:00
Pavel Begunkov
3afcb3b2e3 io_uring: defer iowq cqe overflow via task_work
Don't handle CQE overflows in io_req_complete_post() and defer it to
flush_completions. It cuts some duplication, and I also want to limit
the number of places directly overflowing completions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9046410ac27e18f2baa6f7cdb363ec921cbc3b79.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:47 -06:00
Pavel Begunkov
3f0cb8de56 io_uring: fix retry handling off iowq
io_req_complete_post() doesn't handle reissue and if called with a
REQ_F_REISSUE request it might post extra unexpected completions. Fix it
by pushing into flush_completion via task work.

Fixes: d803d12394 ("io_uring/rw: handle -EAGAIN retry at IO completion time")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/badb3d7e462881e7edbfcc2be6301090b07dbe53.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:43 -06:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Pavel Begunkov
3a4689ac10 io_uring/cmd: add iovec cache for commands
Add iou_vec to commands and wire caching for it, but don't expose it to
users just yet. We need the vec cleared on initial alloc, but since
we can't place it at the beginning at the moment, zero the entire
async_data. It's cached, and the performance effects only the initial
allocation, and it might be not a bad idea since we're exposing those
bits to outside drivers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c0f2145b75791bc6106eb4e72add2cf6a2c72a7a.1742579999.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-21 12:52:15 -06:00
Jens Axboe
07754bfd9a io_uring: enable toggle of iowait usage when waiting on CQEs
By default, io_uring marks a waiting task as being in iowait, if it's
sleeping waiting on events and there are pending requests. This isn't
necessarily always useful, and may be confusing on non-storage setups
where iowait isn't expected. It can also cause extra power usage, by
preventing the CPU from entering lower sleep states.

This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then
io_uring will not account the sleeping task as being in iowait. If the
kernel supports this feature, then it will be marked by having the
IORING_FEAT_NO_IOWAIT feature flag set.

As the kernel currently does not support separating the iowait
accounting and CPU frequency boosting, the IORING_ENTER_NO_IOWAIT
controls both of these at the same time. In the future, if those do end
up being split, then it'd be possible to control them separately.
However, it seems more likely that the kernel will decouple iowait and
CPU frequency boosting anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-20 20:01:03 -06:00
Pavel Begunkov
5f14404bfa io_uring/cmd: don't expose entire cmd async data
io_uring needs private bits in cmd's ->async_data, and they should never
be exposed to drivers as it'd certainly be abused. Leave struct
io_uring_cmd_data for the drivers but wrap it into a structure. It's a
prep patch and doesn't do anything useful yet.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250319061251.21452-3-sidong.yang@furiosa.ai
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19 09:25:55 -06:00
Pavel Begunkov
575e7b0629 io_uring: rename the data cmd cache
Pick a more descriptive name for the cmd async data cache.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250319061251.21452-2-sidong.yang@furiosa.ai
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19 09:25:55 -06:00