2031 Commits

Author SHA1 Message Date
Linus Torvalds
c612261bed Merge tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:

 - A bit of a work-around for AF_UNIX recv multishot, as the in-kernel
   implementation doesn't properly signal EOF. We'll likely rework this
   one going forward, but the fix is sufficient for now

 - Two fixes for incrementally consumed buffers, for non-pollable files
   and for 0 byte reads

* tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/kbuf: propagate BUF_MORE through early buffer commit path
  io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF
  io_uring/poll: fix multishot recv missing EOF on wakeup race
2026-03-20 09:58:56 -07:00
Jens Axboe
418eab7a6f io_uring/kbuf: propagate BUF_MORE through early buffer commit path
When io_should_commit() returns true (eg for non-pollable files), buffer
commit happens at buffer selection time and sel->buf_list is set to
NULL. When __io_put_kbufs() generates CQE flags at completion time, it
calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence
cannot determine whether the buffer was consumed or not. This means that
IORING_CQE_F_BUF_MORE is never set for non-pollable input with
incrementally consumed buffers.

Likewise for io_buffers_select(), which always commits upfront and
discards the return value of io_kbuf_commit().

Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early
commit. Then __io_put_kbuf_ring() can check this flag and set
IORING_F_BUF_MORE accordingy.

Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-19 15:09:48 -06:00
Jens Axboe
3ecd3e0314 io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF
For a zero length transfer, io_kbuf_inc_commit() is called with !len.
Since we never enter the while loop to consume the buffers,
io_kbuf_inc_commit() ends up returning true, consuming the buffer. But
if no data was consumed, by definition it cannot have consumed the
buffer. Return false for that case.

Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-19 15:09:40 -06:00
Jens Axboe
a68ed2df72 io_uring/poll: fix multishot recv missing EOF on wakeup race
When a socket send and shutdown() happen back-to-back, both fire
wake-ups before the receiver's task_work has a chance to run. The first
wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
When io_poll_check_events() runs, it calls io_poll_issue() which does a
recv that reads the data and returns IOU_RETRY. The loop then drains all
accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
the first event was consumed. Since the shutdown is a persistent state
change, no further wakeups will happen, and the multishot recv can hang
forever.

Check specifically for HUP in the poll loop, and ensure that another
loop is done to check for status if more than a single poll activation
is pending. This ensures we don't lose the shutdown event.

Cc: stable@vger.kernel.org
Fixes: dbc2564cfe ("io_uring: let fast poll support multishot")
Reported-by: Francis Brosseau <francis@malagauche.com>
Link: https://github.com/axboe/liburing/issues/1549
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17 13:54:08 -06:00
Linus Torvalds
e67bf352a0 Merge tag 'io_uring-7.0-20260312' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:

 - Fix an inverted true/false comment on task_no_new_privs, from the
   BPF filtering changes merged in this release

 - Use the migration disabling way of running the BPF filters, as the
   io_uring side doesn't do that already

 - Fix an issue with ->rings stability under resize, both for local
   task_work additions and for eventfd signaling

 - Fix an issue with SQE mixed mode, where a bounds check wasn't correct
   for having a 128b SQE

 - Fix an issue where a legacy provided buffer group is changed to to
   ring mapped one while legacy buffers from that group are in flight

* tag 'io_uring-7.0-20260312' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/kbuf: check if target buffer list is still legacy on recycle
  io_uring: fix physical SQE bounds check for SQE_MIXED 128-byte ops
  io_uring/eventfd: use ctx->rings_rcu for flags checking
  io_uring: ensure ctx->rings is stable for task work flags manipulation
  io_uring/bpf_filter: use bpf_prog_run_pin_on_cpu() to prevent migration
  io_uring/register: fix comment about task_no_new_privs
2026-03-13 10:09:35 -07:00
Jens Axboe
c2c185be5c io_uring/kbuf: check if target buffer list is still legacy on recycle
There's a gap between when the buffer was grabbed and when it
potentially gets recycled, where if the list is empty, someone could've
upgraded it to a ring provided type. This can happen if the request
is forced via io-wq. The legacy recycling is missing checking if the
buffer_list still exists, and if it's of the correct type. Add those
checks.

Cc: stable@vger.kernel.org
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-12 08:59:25 -06:00
Tom Ryan
6f02c6b196 io_uring: fix physical SQE bounds check for SQE_MIXED 128-byte ops
When IORING_SETUP_SQE_MIXED is used without IORING_SETUP_NO_SQARRAY,
the boundary check for 128-byte SQE operations in io_init_req()
validated the logical SQ head position rather than the physical SQE
index.

The existing check:

  !(ctx->cached_sq_head & (ctx->sq_entries - 1))

ensures the logical position isn't at the end of the ring, which is
correct for NO_SQARRAY rings where physical == logical. However, when
sq_array is present, an unprivileged user can remap any logical
position to an arbitrary physical index via sq_array. Setting
sq_array[N] = sq_entries - 1 places a 128-byte operation at the last
physical SQE slot, causing the 128-byte memcpy in
io_uring_cmd_sqe_copy() to read 64 bytes past the end of the SQE
array.

Replace the cached_sq_head alignment check with a direct validation
of the physical SQE index, which correctly handles both sq_array and
NO_SQARRAY cases.

Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Tom Ryan <ryan36005@gmail.com>
Link: https://patch.msgid.link/20260310052003.72871-1-ryan36005@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-11 14:35:19 -06:00
Jens Axboe
177c694321 io_uring/eventfd: use ctx->rings_rcu for flags checking
Similarly to what commit e78f7b70e837 did for local task work additions,
use ->rings_rcu under RCU rather than dereference ->rings directly. See
that commit for more details.

Cc: stable@vger.kernel.org
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-11 14:35:19 -06:00
Jens Axboe
9618908026 io_uring: ensure ctx->rings is stable for task work flags manipulation
If DEFER_TASKRUN | SETUP_TASKRUN is used and task work is added while
the ring is being resized, it's possible for the OR'ing of
IORING_SQ_TASKRUN to happen in the small window of swapping into the
new rings and the old rings being freed.

Prevent this by adding a 2nd ->rings pointer, ->rings_rcu, which is
protected by RCU. The task work flags manipulation is inside RCU
already, and if the resize ring freeing is done post an RCU synchronize,
then there's no need to add locking to the fast path of task work
additions.

Note: this is only done for DEFER_TASKRUN, as that's the only setup mode
that supports ring resizing. If this ever changes, then they too need to
use the io_ctx_mark_taskrun() helper.

Link: https://lore.kernel.org/io-uring/20260309062759.482210-1-naup96721@gmail.com/
Cc: stable@vger.kernel.org
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reported-by: Hao-Yu Yang <naup96721@gmail.com>
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-11 14:35:16 -06:00
Jens Axboe
785d4625d3 io_uring/bpf_filter: use bpf_prog_run_pin_on_cpu() to prevent migration
Since the caller, __io_uring_run_bpf_filters(), doesn't prevent
migration, it should use the migration disabling variant for running
the BPF program.

Fixes: d42eb05e60 ("io_uring: add support for BPF filtering for opcode restrictions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-09 14:20:14 -06:00
Jann Horn
3306a589e5 io_uring/register: fix comment about task_no_new_privs
The actual code is right, but the comment is the wrong way around.

Fixes: ed82f35b92 ("io_uring: allow registration of per-task restrictions")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-09 08:40:18 -06:00
Linus Torvalds
3ad66a34cc Merge tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:

 - Fix a typo in the mock_file help text

 - Fix a comment regarding IORING_SETUP_TASKRUN_FLAG in the
   io_uring.h UAPI header

 - Use READ_ONCE() for reading refill queue entries

 - Reject SEND_VECTORIZED for fixed buffer sends, as it isn't
   implemented. Currently this flag is silently ignored

   This is in preparation for making these work, but first we
   need a fixup so that older kernels will correctly reject them

 - Ensure "0" means default for the rx page size

* tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/zcrx: use READ_ONCE with user shared RQEs
  io_uring/mock: Fix typo in help text
  io_uring/net: reject SEND_VECTORIZED when unsupported
  io_uring: correct comment for IORING_SETUP_TASKRUN_FLAG
  io_uring/zcrx: don't set rx_page_size when not requested
2026-03-06 08:31:36 -08:00
Pavel Begunkov
531bb98a03 io_uring/zcrx: use READ_ONCE with user shared RQEs
Refill queue entries are shared with the user space, use READ_ONCE when
reading them.

Fixes: 34a3e60821 ("io_uring/zcrx: implement zerocopy receive pp memory provider");
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-04 06:30:39 -07:00
Pavel Begunkov
c36e28becd io_uring/net: reject SEND_VECTORIZED when unsupported
IORING_SEND_VECTORIZED with registered buffers is not implemented but
could be. Don't silently ignore the flag in this case but reject it with
an error. It only affects sendzc as normal sends don't support
registered buffers.

Fixes: 6f02527729 ("io_uring/net: Allow to do vectorized send")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-02 09:17:04 -07:00
Jakub Kicinski
3d17d76d1f io_uring/zcrx: don't set rx_page_size when not requested
The rx_buf_len parameter was recently added to the Rx zero-copy
implementation. The expectation is that when not set system will
maintain previous behavior and use the default buffer size (PAGE_SIZE).

This works correctly at the iouring level, but we don't preserve
the same "zero means default" semantics when registering the memory
provider on the netdev. mp_param.rx_page_size is unconditionally
set to PAGE_SIZE. This causes __net_mp_open_rxq() to check for
QCFG_RX_PAGE_SIZE support in the driver, and return -EOPNOTSUPP
for drivers that don't advertise it -- even though the user never
asked for large buffers.

Only set mp_param.rx_page_size when rx_buf_len was explicitly provided,
so that the default page size path works on all zcrx-capable drivers.
mlx5 and fbnic only support 4kB pages in the current release.

Fixes: 795663b4d1 ("io_uring/zcrx: implement large rx buffer support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-27 12:32:49 -07:00
Linus Torvalds
530b0b61df Merge tag 'io_uring-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
 "Just two minor patches in here, ensuring the use of READ_ONCE() for
  sqe field reading is consistent across the codebase. There were two
  missing cases, now they are covered too"

* tag 'io_uring-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/timeout: READ_ONCE sqe->addr
  io_uring/cmd_net: use READ_ONCE() for ->addr3 read
2026-02-27 10:39:11 -08:00
Pavel Begunkov
85f6c439a6 io_uring/timeout: READ_ONCE sqe->addr
We should use READ_ONCE when reading from a SQE, make sure timeout gets
a stable timespec address.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-25 08:36:05 -07:00
Jens Axboe
a46435537a io_uring/cmd_net: use READ_ONCE() for ->addr3 read
Any SQE read should use READ_ONCE(), to ensure the result is read once
and only once. Doesn't really matter for this case, but it's better to
keep these 100% consistent and always use READ_ONCE() for the prep side
of SQE handling.

Fixes: 5d24321e4c ("io_uring: Introduce getsockname io_uring cmd")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-24 11:38:34 -07:00
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
8934827db5 Merge tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kmalloc_obj conversion from Kees Cook:
 "This does the tree-wide conversion to kmalloc_obj() and friends using
  coccinelle, with a subsequent small manual cleanup of whitespace
  alignment that coccinelle does not handle.

  This uncovered a clang bug in __builtin_counted_by_ref(), so the
  conversion is preceded by disabling that for current versions of
  clang.  The imminent clang 22.1 release has the fix.

  I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I
  did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv,
  s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc"

* tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kmalloc_obj: Clean up after treewide replacements
  treewide: Replace kmalloc with kmalloc_obj for non-scalar types
  compiler_types: Disable __builtin_counted_by_ref for Clang
2026-02-21 11:02:58 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Caleb Sander Mateos
42a6bd57ee io_uring: add IORING_OP_URING_CMD128 to opcode checks
io_should_commit(), io_uring_classic_poll(), and io_do_iopoll() compare
struct io_kiocb's opcode against IORING_OP_URING_CMD to implement
special treatment for uring_cmds. The recently added opcode
IORING_OP_URING_CMD128 is meant to be equivalent to IORING_OP_URING_CMD,
so treat it the same way in these functions.

Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-19 07:25:39 -07:00
Kai Aizen
003049b1c4 io_uring/zcrx: fix user_ref race between scrub and refill paths
The io_zcrx_put_niov_uref() function uses a non-atomic
check-then-decrement pattern (atomic_read followed by separate
atomic_dec) to manipulate user_refs. This is serialized against other
callers by rq_lock, but io_zcrx_scrub() modifies the same counter with
atomic_xchg() WITHOUT holding rq_lock.

On SMP systems, the following race exists:

  CPU0 (refill, holds rq_lock)          CPU1 (scrub, no rq_lock)
  put_niov_uref:
    atomic_read(uref) - 1
    // window opens
                                        atomic_xchg(uref, 0) - 1
                                        return_niov_freelist(niov) [PUSH #1]
    // window closes
    atomic_dec(uref) - wraps to -1
    returns true
    return_niov(niov)
    return_niov_freelist(niov)           [PUSH #2: DOUBLE-FREE]

The same niov is pushed to the freelist twice, causing free_count to
exceed nr_iovs. Subsequent freelist pushes then perform an out-of-bounds
write (a u32 value) past the kvmalloc'd freelist array into the adjacent
slab object.

Fix this by replacing the non-atomic read-then-dec in
io_zcrx_put_niov_uref() with an atomic_try_cmpxchg loop that atomically
tests and decrements user_refs. This makes the operation safe against
concurrent atomic_xchg from scrub without requiring scrub to acquire
rq_lock.

Fixes: 34a3e60821 ("io_uring/zcrx: implement zerocopy receive pp memory provider")
Cc: stable@vger.kernel.org
Signed-off-by: Kai Aizen <kai@snailsploit.com>
[pavel: removed a warning and a comment]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-18 10:39:48 -07:00
Linus Torvalds
7b751b01ad Merge tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull more io_uring updates from Jens Axboe:
 "This is a mix of cleanups and fixes. No major fixes in here, just a
  bunch of little fixes. Some of them marked for stable as it fixes
  behavioral issues

   - Fix an issue with SOCKET_URING_OP_SETSOCKOPT for netlink sockets,
     due to a too restrictive check on it having an ioctl handler

   - Remove a redundant SQPOLL check in ring creation

   - Kill dead accounting for zero-copy send, which doesn't use ->buf
     or ->len post the initial setup

   - Fix missing clamp of the allocation hint, which could cause
     allocations to fall outside of the range the application asked
     for. Still within the allowed limits.

   - Fix for IORING_OP_PIPE's handling of direct descriptors

   - Tweak to the API for the newly added BPF filters, making them
     more future proof in terms of how applications deal with them

   - A few fixes for zcrx, fixing a few error handling conditions

   - Fix for zcrx request flag checking

   - Add support for querying the zcrx page size

   - Improve the NO_SQARRAY static branch inc/dec, avoiding busy
     conditions causing too much traffic

   - Various little cleanups"

* tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/bpf_filter: pass in expected filter payload size
  io_uring/bpf_filter: move filter size and populate helper into struct
  io_uring/cancel: de-unionize file and user_data in struct io_cancel_data
  io_uring/rsrc: improve regbuf iov validation
  io_uring: remove unneeded io_send_zc accounting
  io_uring/cmd_net: fix too strict requirement on ioctl
  io_uring: delay sqarray static branch disablement
  io_uring/query: add query.h copyright notice
  io_uring/query: return support for custom rx page size
  io_uring/zcrx: check unsupported flags on import
  io_uring/zcrx: fix post open error handling
  io_uring/zcrx: fix sgtable leak on mapping failures
  io_uring: use the right type for creds iteration
  io_uring/openclose: fix io_pipe_fixed() slot tracking for specific slots
  io_uring/filetable: clamp alloc_hint to the configured alloc range
  io_uring/rsrc: replace reg buffer bit field with flags
  io_uring/zcrx: improve types for size calculation
  io_uring/tctx: avoid modifying loop variable in io_ring_add_registered_file
  io_uring: simplify IORING_SETUP_DEFER_TASKRUN && !SQPOLL check
2026-02-17 08:33:49 -08:00
Jens Axboe
be3573124e io_uring/bpf_filter: pass in expected filter payload size
It's quite possible that opcodes that have payloads attached to them,
like IORING_OP_OPENAT/OPENAT2 or IORING_OP_SOCKET, that these paylods
can change over time. For example, on the openat/openat2 side, the
struct open_how argument is extensible, and could be extended in the
future to allow further arguments to be passed in.

Allow registration of a cBPF filter to give the size of the filter as
seen by userspace. If that filter is for an opcode that takes extra
payload data, allow it if the application payload expectation is the
same size than the kernels. If that is the case, the kernel supports
filtering on the payload that the application expects. If the size
differs, the behavior depends on the IO_URING_BPF_FILTER_SZ_STRICT flag:

1) If IO_URING_BPF_FILTER_SZ_STRICT is set and the size expectation
   differs, fail the attempt to load the filter.

2) If IO_URING_BPF_FILTER_SZ_STRICT isn't set, allow the filter if
   the userspace pdu size is smaller than what the kernel offers.

3) Regardless if IO_URING_BPF_FILTER_SZ_STRICT, fail loading the filter
   if the userspace pdu size is bigger than what the kernel supports.

An attempt to load a filter due to sizing will error with -EMSGSIZE.
For that error, the registration struct will have filter->pdu_size
populated with the pdu size that the kernel uses.

Reported-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16 15:56:31 -07:00
Jens Axboe
d21c362182 io_uring/bpf_filter: move filter size and populate helper into struct
Rather than open-code this logic in io_uring_populate_bpf_ctx() with
a switch, move it to the issue side definitions. Outside of making this
easier to extend in the future, it's also a prep patch for using the
pdu size for a given opcode filter elsewhere.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16 15:56:25 -07:00
Jens Axboe
22dbb0987b io_uring/cancel: de-unionize file and user_data in struct io_cancel_data
By having them share the same space in struct io_cancel_data, it ends up
disallowing IORING_ASYNC_CANCEL_FD|IORING_ASYNC_CANCEL_USERDATA from
working. Eg you cannot match on both a file and user_data for
cancelation purposes. This obviously isn't a common use case as nobody
has reported this, but it does result in -ENOENT potentially being
returned when trying to match on both, rather than actually doing what
the API says it would.

Fixes: 4bf94615b8 ("io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16 14:16:27 -07:00
Pavel Begunkov
2e02f9efdb io_uring/rsrc: improve regbuf iov validation
Deduplicate io_buffer_validate() calls by moving the checks into
io_sqe_buffer_register(). Now we also don't need special handling in
io_buffer_validate() passing through buffer removal requests. I also
was using it as a cleanup before some other changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16 08:15:38 -07:00
Dylan Yudaken
046fcc83ac io_uring: remove unneeded io_send_zc accounting
zc->len and zc->buf are not actually used once you get to the retry
stage. The buffer remains in kmsg->msg.msg_iter, which is setup in
io_send_setup.
Note: it still seems needed in io_send due to io_send_select_buffer
needing it (for the len parameter).

Signed-off-by: Dylan Yudaken <dyudaken@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16 08:10:46 -07:00
Asbjørn Sloth Tønnesen
600b665b90 io_uring/cmd_net: fix too strict requirement on ioctl
Attempting SOCKET_URING_OP_SETSOCKOPT on an AF_NETLINK socket resulted
in an -EOPNOTSUPP, as AF_NETLINK doesn't have an ioctl in its struct
proto, but only in struct proto_ops.

Prior to the blamed commit, io_uring_cmd_sock() only had two cmd_op
operations, both requiring ioctl, thus the check was warranted.

Since then, 4 new cmd_op operations have been added, none of which
depend on ioctl. This patch moves the ioctl check, so it only applies
to the original operations.

AFAICT, the ioctl requirement was unintentional, and it wasn't
visible in the blamed patch within 3 lines of context.

Cc: stable@vger.kernel.org
Fixes: a5d2f99aff ("io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT")
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16 08:08:01 -07:00
Pavel Begunkov
56112578c7 io_uring: delay sqarray static branch disablement
io_key_has_sqarray static branch can be easily switched on/off by the
user every time patching the kernel. That can be very disruptive as it
might require heavy synchronisation across all CPUs. Use deferred static
keys, which can rate-limit it by deferring, batching and potentially
effectively eliminating dec+inc pairs.

Fixes: 9b296c625a ("io_uring: static_key for !IORING_SETUP_NO_SQARRAY")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-15 15:12:54 -07:00
Pavel Begunkov
c29214677a io_uring/query: return support for custom rx page size
Add an ability to query if the zcrx rx page size setting is available.

Note, even when the API is supported by io_uring, the registration can
still get rejected for various reasons, e.g. when the NIC or the driver
doesn't support it, when the particular specified size is unsupported,
when the memory area doesn't satisfy all requirements, etc.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-15 14:55:37 -07:00
Pavel Begunkov
7496e658a7 io_uring/zcrx: check unsupported flags on import
The imoorted zcrx registration path checks for ZCRX_REG_IMPORT, as it
should, but doesn't reject any unsupported flags. Fix that.

Cc: stable@vger.kernel.org
Fixes: 00d9148127 ("io_uring/zcrx: share an ifq between rings")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-15 14:55:29 -07:00
Pavel Begunkov
5d540e4508 io_uring/zcrx: fix post open error handling
Closing a queue doesn't guarantee that all associated page pools are
terminated right away, let the refcounting do the work instead of
releasing the zcrx ctx directly.

Cc: stable@vger.kernel.org
Fixes: e0793de24a ("io_uring/zcrx: set pp memory provider for an rx queue")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-14 18:05:08 -07:00
Pavel Begunkov
a983aae397 io_uring/zcrx: fix sgtable leak on mapping failures
In an unlikely case when io_populate_area_dma() fails, which could only
happen on a PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA machine,
io_zcrx_map_area() will have an initialised and not freed table. It was
supposed to be cleaned up in the error path, but !is_mapped prevents
that.

Fixes: 439a98b972 ("io_uring/zcrx: deduplicate area mapping")
Cc: stable@vger.kernel.org
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-14 18:05:00 -07:00
Linus Torvalds
041c16acba Merge tag 'for-7.0/io_uring-zcrx-large-buffers-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring large rx buffer support from Jens Axboe:
 "Now that the networking updates are upstream, here's the support for
  large buffers for zcrx.

  Using larger (bigger than 4K) rx buffers can increase the effiency of
  zcrx. For example, it's been shown that using 32K buffers can decrease
  CPU usage by ~30% compared to 4K buffers"

* tag 'for-7.0/io_uring-zcrx-large-buffers-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/zcrx: implement large rx buffer support
2026-02-12 15:07:50 -08:00
Jens Axboe
d7d95207ca io_uring: use the right type for creds iteration
In io_ring_ctx_wait_and_kill(), struct creds *creds is used to
iterate and prune credentials. But the correct type is struct cred.
This doesn't matter as the variable isn't used at all, only the index
is used. But it's confusing using a type that isn't valid, so fix it
up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-11 20:31:58 -07:00
Jens Axboe
f4d0668b38 io_uring/openclose: fix io_pipe_fixed() slot tracking for specific slots
__io_fixed_fd_install() returns 0 on success for non-alloc mode
(specific slot), not the slot index. io_pipe_fixed() used this return
value directly as the slot index in fds[], which can cause the reported
values returned via copy_to_user() to be incorrect, or the error path
operating on the incorrect direct descriptor.

Fix by computing the actual 0-based slot index (slot - 1) for specific
slot mode, while preserving the existing behavior for auto-alloc mode
where __io_fixed_fd_install() already returns the allocated index.

Cc: stable@vger.kernel.org
Fixes: 53db8a71ec ("io_uring: add support for IORING_OP_PIPE")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-11 20:31:21 -07:00
Jens Axboe
a6bded921e io_uring/filetable: clamp alloc_hint to the configured alloc range
Explicit fixed file install/remove operations on slots outside the
configured alloc range can corrupt alloc_hint via io_file_bitmap_set()
and io_file_bitmap_clear(), which unconditionally update alloc_hint to
the bit position. This causes subsequent auto-allocations to fall
outside the configured range.

For example, if the alloc range is [10, 20) and a file is removed at
slot 2, alloc_hint gets set to 2. The next auto-alloc then starts
searching from slot 2, potentially returning a slot below the range.

Fix this by clamping alloc_hint to [file_alloc_start, file_alloc_end)
at the top of io_file_bitmap_get() before starting the search.

Cc: stable@vger.kernel.org
Fixes: 6e73dffbb9 ("io_uring: let to set a range for file slot allocation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-11 15:20:44 -07:00
Pavel Begunkov
0efc331d78 io_uring/rsrc: replace reg buffer bit field with flags
I'll need a flag in the registered buffer struct for dmabuf work, and
it'll be more convenient to have a flags field rather than bit fields,
especially for io_mapped_ubuf initialisation.

We might want to add more flags in the future as well. For example, it
might be useful for debugging and potentially optimisations to split out
a flag indicating the shape of the buffer to gate iov_iter_advance()
walks vs bit/mask arithmetics. It can also be combined with the
direction mask field.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-10 05:26:15 -07:00
Pavel Begunkov
417d029dc4 io_uring/zcrx: improve types for size calculation
Make sure io_import_umem() promotes the type to long before calculating
the area size. While the area size is capped at 1GB by
io_validate_user_buf_range() and fits into an "int", it's still too
error prone.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-10 05:26:12 -07:00
Yang Xiuwei
daa0b901f8 io_uring/tctx: avoid modifying loop variable in io_ring_add_registered_file
Use a separate 'idx' variable to store the result of array_index_nospec()
instead of modifying the loop variable 'offset' directly. This improves
code clarity by separating the logical index from the sanitized index
used for array access.

No functional change intended.

Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-09 20:12:46 -07:00
Caleb Sander Mateos
7cb3a68376 io_uring: simplify IORING_SETUP_DEFER_TASKRUN && !SQPOLL check
io_uring_sanitise_params() already rejects flags that include both
IORING_SETUP_SQPOLL and IORING_SETUP_DEFER_TASKRUN. So it's unnecessary
to check IORING_SETUP_SQPOLL in io_uring_create() when
IORING_SETUP_DEFER_TASKRUN has already been checked. Drop the
!(ctx->flags & IORING_SETUP_SQPOLL) check for the task_complete case.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-09 20:12:36 -07:00
Linus Torvalds
0c00ed308d Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:

 - Support for batch request processing for ublk, improving the
   efficiency of the kernel/ublk server communication. This can yield
   nice 7-12% performance improvements

 - Support for integrity data for ublk

 - Various other ublk improvements and additions, including a ton of
   selftests additions and updated

 - Move the handling of blk-crypto software fallback from below the
   block layer to above it. This reduces the complexity of dealing with
   bio splitting

 - Series fixing a number of potential deadlocks in blk-mq related to
   the queue usage counter and writeback throttling and rq-qos debugfs
   handling

 - Add an async_depth queue attribute, to resolve a performance
   regression that's been around for a qhilw related to the scheduler
   depth handling

 - Only use task_work for IOPOLL completions on NVMe, if it is necessary
   to do so. An earlier fix for an issue resulted in all these
   completions being punted to task_work, to guarantee that completions
   were only run for a given io_uring ring when it was local to that
   ring. With the new changes, we can detect if it's necessary to use
   task_work or not, and avoid it if possible.

 - rnbd fixes:
      - Fix refcount underflow in device unmap path
      - Handle PREFLUSH and NOUNMAP flags properly in protocol
      - Fix server-side bi_size for special IOs
      - Zero response buffer before use
      - Fix trace format for flags
      - Add .release to rnbd_dev_ktype

 - MD pull requests via Yu Kuai
      - Fix raid5_run() to return error when log_init() fails
      - Fix IO hang with degraded array with llbitmap
      - Fix percpu_ref not resurrected on suspend timeout in llbitmap
      - Fix GPF in write_page caused by resize race
      - Fix NULL pointer dereference in process_metadata_update
      - Fix hang when stopping arrays with metadata through dm-raid
      - Fix any_working flag handling in raid10_sync_request
      - Refactor sync/recovery code path, improve error handling for
        badblocks, and remove unused recovery_disabled field
      - Consolidate mddev boolean fields into mddev_flags
      - Use mempool to allocate stripe_request_ctx and make sure
        max_sectors is not less than io_opt in raid5
      - Fix return value of mddev_trylock
      - Fix memory leak in raid1_run()
      - Add Li Nan as mdraid reviewer

 - Move phys_vec definitions to the kernel types, mostly in preparation
   for some VFIO and RDMA changes

 - Improve the speed for secure erase for some devices

 - Various little rust updates

 - Various other minor fixes, improvements, and cleanups

* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  blk-mq: ABI/sysfs-block: fix docs build warnings
  selftests: ublk: organize test directories by test ID
  block: decouple secure erase size limit from discard size limit
  block: remove redundant kill_bdev() call in set_blocksize()
  blk-mq: add documentation for new queue attribute async_dpeth
  block, bfq: convert to use request_queue->async_depth
  mq-deadline: covert to use request_queue->async_depth
  kyber: covert to use request_queue->async_depth
  blk-mq: add a new queue sysfs attribute async_depth
  blk-mq: factor out a helper blk_mq_limit_depth()
  blk-mq-sched: unify elevators checking for async requests
  block: convert nr_requests to unsigned int
  block: don't use strcpy to copy blockdev name
  blk-mq-debugfs: warn about possible deadlock
  blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
  blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
  blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
  blk-rq-qos: fix possible debugfs_mutex deadlock
  blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
  blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
  ...
2026-02-09 17:57:21 -08:00
Linus Torvalds
591beb0e3a Merge tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring bpf filters from Jens Axboe:
 "This adds support for both cBPF filters for io_uring, as well as task
  inherited restrictions and filters.

  seccomp and io_uring don't play along nicely, as most of the
  interesting data to filter on resides somewhat out-of-band, in the
  submission queue ring.

  As a result, things like containers and systemd that apply seccomp
  filters, can't filter io_uring operations.

  That leaves them with just one choice if filtering is critical -
  filter the actual io_uring_setup(2) system call to simply disallow
  io_uring. That's rather unfortunate, and has limited us because of it.

  io_uring already has some filtering support. It requires the ring to
  be setup in a disabled state, and then a filter set can be applied.
  This filter set is completely bi-modal - an opcode is either enabled
  or it's not. Once a filter set is registered, the ring can be enabled.
  This is very restrictive, and it's not useful at all to systemd or
  containers which really want both broader and more specific control.

  This first adds support for cBPF filters for opcodes, which enables
  tighter control over what exactly a specific opcode may do. As
  examples, specific support is added for IORING_OP_OPENAT/OPENAT2,
  allowing filtering on resolve flags. And another example is added for
  IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These
  are both common use cases. cBPF was chosen rather than eBPF, because
  the latter is often restricted in containers as well.

  These filters are run post the init phase of the request, which allows
  filters to even dip into data that is being passed in struct in user
  memory, as the init side of requests make that data stable by bringing
  it into the kernel. This allows filtering without needing to copy this
  data twice, or have filters etc know about the exact layout of the
  user data. The filters get the already copied and sanitized data
  passed.

  On top of that support is added for per-task filters, meaning that any
  ring created with a task that has a per-task filter will get those
  filters applied when it's created. These filters are inherited across
  fork as well. Once a filter has been registered, any further added
  filters may only further restrict what operations are permitted.

  Filters cannot change the return value of an operation, they can only
  permit or deny it based on the contents"

* tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring: allow registration of per-task restrictions
  io_uring: add task fork hook
  io_uring/bpf_filter: add ref counts to struct io_bpf_filter
  io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
  io_uring/bpf_filter: allow filtering on contents of struct open_how
  io_uring/net: allow filtering on IORING_OP_SOCKET data
  io_uring: add support for BPF filtering for opcode restrictions
2026-02-09 17:31:17 -08:00
Linus Torvalds
f5d4feed17 Merge tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:

 - Clean up the IORING_SETUP_R_DISABLED and submitter task checking,
   mostly just in preparation for relaxing the locking for SINGLE_ISSUER
   in the future.

 - Improve IOPOLL by using a doubly linked list to manage completions.

   Previously it was singly listed, which meant that to complete request
   N in the chain 0..N-1 had to have completed first. With a doubly
   linked list we can complete whatever request completes in that order,
   rather than need to wait for a consecutive range to be available.
   This reduces latencies.

 - Improve the restriction setup and checking. Mostly in preparation for
   adding further features on top of that. Coming in a separate pull
   request.

 - Split out task_work and wait handling into separate files. These are
   mostly nicely abstracted already, but still remained in the
   io_uring.c file which is on the larger side.

 - Use GFP_KERNEL_ACCOUNT in a few more spots, where appropriate.

 - Ensure even the idle io-wq worker exits if a task no longer has any
   rings open.

 - Add support for a non-circular submission queue.

   By default, the SQ ring keeps moving around, even if only a few
   entries are used for each submission. This can be wasteful in terms
   of cachelines.

   If IORING_SETUP_SQ_REWIND is set for the ring when created, each
   submission will start at offset 0 instead of where we last left off
   doing submissions.

 - Various little cleanups

* tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (30 commits)
  io_uring/kbuf: fix memory leak if io_buffer_add_list fails
  io_uring: Add SPDX id lines to remaining source files
  io_uring: allow io-wq workers to exit when unused
  io_uring/io-wq: add exit-on-idle state
  io_uring/net: don't continue send bundle if poll was required for retry
  io_uring/rsrc: use GFP_KERNEL_ACCOUNT consistently
  io_uring/futex: use GFP_KERNEL_ACCOUNT for futex data allocation
  io_uring/io-wq: handle !sysctl_hung_task_timeout_secs
  io_uring: fix bad indentation for setup flags if statement
  io_uring/rsrc: take unsigned index in io_rsrc_node_lookup()
  io_uring: introduce non-circular SQ
  io_uring: split out CQ waiting code into wait.c
  io_uring: split out task work code into tw.c
  io_uring/io-wq: don't trigger hung task for syzbot craziness
  io_uring: add IO_URING_EXIT_WAIT_MAX definition
  io_uring/sync: validate passed in offset
  io_uring/eventfd: remove unused ctx->evfd_last_cq_tail member
  io_uring/timeout: annotate data race in io_flush_timeouts()
  io_uring/uring_cmd: explicitly disallow cancelations for IOPOLL
  io_uring: fix IOPOLL with passthrough I/O
  ...
2026-02-09 17:22:00 -08:00
Linus Torvalds
26c9342bb7 Merge tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs 'struct filename' updates from Al Viro:
 "[Mostly] sanitize struct filename handling"

* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
  sysfs(2): fs_index() argument is _not_ a pathname
  alpha: switch osf_mount() to strndup_user()
  ksmbd: use CLASS(filename_kernel)
  mqueue: switch to CLASS(filename)
  user_statfs(): switch to CLASS(filename)
  statx: switch to CLASS(filename_maybe_null)
  quotactl_block(): switch to CLASS(filename)
  chroot(2): switch to CLASS(filename)
  move_mount(2): switch to CLASS(filename_maybe_null)
  namei.c: switch user pathname imports to CLASS(filename{,_flags})
  namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
  do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
  do_readlinkat(): switch to CLASS(filename_flags)
  do_sys_truncate(): switch to CLASS(filename)
  do_utimes_path(): switch to CLASS(filename_uflags)
  chdir(2): unspaghettify a bit...
  do_fchownat(): unspaghettify a bit...
  fspick(2): use CLASS(filename_flags)
  name_to_handle_at(): use CLASS(filename_uflags)
  vfs_open_tree(): use CLASS(filename_uflags)
  ...
2026-02-09 16:58:28 -08:00
Jens Axboe
ed82f35b92 io_uring: allow registration of per-task restrictions
Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.

This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
== -1, like the other "blind" register opcodes which work on the task
rather than a specific ring. This allows registration of the same kind
of restrictions as can been done on a specific ring, but with the task
itself. Once done, any ring created will inherit these restrictions.

If a restriction filter is registered with a task, then it's inherited
on fork for its children. Children may only further restrict operations,
not extend them.

Inheriting restrictions include both the classic
IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
filters that have been registered with the task via
IORING_REGISTER_BPF_FILTER.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-06 07:29:19 -07:00
Jens Axboe
9fd99788f3 io_uring: add task fork hook
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-06 07:29:14 -07:00