linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-03-22 07:27:12 +08:00

Author	SHA1	Message	Date
Linus Torvalds	92f778a0b1	Merge tag 'io_uring-6.19-20260205' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Two small fixes for zcrx - Two small fixes for fdinfo - one is just killing a superflous newline * tag 'io_uring-6.19-20260205' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/fdinfo: be a bit nicer when looping a lot of SQEs/CQEs io_uring/fdinfo: kill unnecessary newline feed in CQE32 printing io_uring/zcrx: fix rq flush locking io_uring/zcrx: fix page array leak	2026-02-05 14:40:06 -08:00
Jens Axboe	442ae40660	io_uring/kbuf: fix memory leak if io_buffer_add_list fails io_register_pbuf_ring() ignores the return value of io_buffer_add_list(), which can fail if xa_store() returns an error (e.g., -ENOMEM). When this happens, the function returns 0 (success) to the caller, but the io_buffer_list structure is neither added to the xarray nor freed. In practice this requires failure injection to hit, hence not a real issue. But it should get fixed up none the less. Fixes: `c7fb19428d` ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-05 11:13:16 -07:00
Tim Bird	ccd18ce290	io_uring: Add SPDX id lines to remaining source files Some io_uring files are missing SPDX-License-Identifier lines. Add lines with GPL-2.0 license IDs to these files. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-04 07:23:45 -07:00
Jens Axboe	38cfdd9dd2	io_uring/fdinfo: be a bit nicer when looping a lot of SQEs/CQEs Add cond_resched() in those dump loops, just in case a lot of entries are being dumped. And detect invalid CQ ring head/tail entries, to avoid iterating more than what is necessary. Generally not an issue, but can be if things like KASAN or other debugging metrics are enabled. Reported-by: 是参差 <shicenci@gmail.com> Link: https://lore.kernel.org/all/PS1PPF7E1D7501FE5631002D242DD89403FAB9BA@PS1PPF7E1D7501F.apcprd02.prod.outlook.com/ Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-03 10:58:32 -07:00
Jens Axboe	b1dfe4e0fc	io_uring/fdinfo: kill unnecessary newline feed in CQE32 printing There's an unconditional newline feed anyway after dumping both normal and big CQE contents, remove the \n from the CQE32 extra1/extra2 printing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-03 10:00:39 -07:00
Pavel Begunkov	af07330e28	io_uring/zcrx: fix rq flush locking zcrx needs to keep the rq lock for uref manipulations, for now move all zcrx_return_buffers() under the lock. Fixes: `475eb39b00` ("io_uring/zcrx: add sync refill queue flushing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 08:19:43 -07:00
Pavel Begunkov	0ae91d8ab7	io_uring/zcrx: fix page array leak `d9f595b9a6` ("io_uring/zcrx: fix leaking pages on sg init fail") fixed a page leakage but didn't free the page array, release it as well. Fixes: `b84621d96e` ("io_uring/zcrx: allocate sgtable for umem areas") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 08:19:35 -07:00
Li Chen	9121466148	io_uring: allow io-wq workers to exit when unused io_uring keeps a per-task io-wq around, even when the task no longer has any io_uring instances. If the task previously used io_uring for file I/O, this can leave an unrelated iou-wrk-* worker thread behind after the last io_uring instance is gone. When the last io_uring ctx is removed from the task context, mark the io-wq exit-on-idle so workers can go away. Clear the flag on subsequent io_uring usage. Signed-off-by: Li Chen <me@linux.beauty> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 08:11:42 -07:00
Li Chen	38aa434ab9	io_uring/io-wq: add exit-on-idle state io-wq uses an idle timeout to shrink the pool, but keeps the last worker around indefinitely to avoid churn. For tasks that used io_uring for file I/O and then stop using io_uring, this can leave an iou-wrk-* thread behind even after all io_uring instances are gone. This is unnecessary overhead and also gets in the way of process checkpoint/restore. Add an exit-on-idle state that makes all io-wq workers exit as soon as they become idle, and provide io_wq_set_exit_on_idle() to toggle it. Signed-off-by: Li Chen <me@linux.beauty> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 08:10:23 -07:00
Jens Axboe	806ae939c4	io_uring/net: don't continue send bundle if poll was required for retry If a send bundle has picked a bunch of buffers, then it needs to send all of those to be complete. This may require poll arming, if the send buffer ends up being full. Once a send bundle has been poll armed, no further bundles should be attempted. This allows a current bundle to complete even though it needs to go through polling to do so, but it will not allow another bundle to be started once that has happened. Ideally we would abort a bundle if it was only partially sent, but as some parts of it already went out on the wire, this obviously isn't feasible. Not continuing more bundle attempts post encountering a full socket buffer is the second best thing. Cc: stable@vger.kernel.org Fixes: `a05d1f625c` ("io_uring/net: support bundles for send") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 21:06:28 -07:00
Jens Axboe	e7f67c2be7	io_uring/bpf_filter: add ref counts to struct io_bpf_filter In preparation for allowing inheritance of BPF filters and filter tables, add a reference count to the filter. This allows multiple tables to safely include the same filter. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:10:46 -07:00
Jens Axboe	e7c30675a7	io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Currently a few pointer dereferences need to be made to both check if BPF filters are installed, and then also to retrieve the actual filter for the opcode. Cache the table in ctx->bpf_filters to avoid that. Add a bit of debug info on ring exit to show if we ever got this wrong. Small risk of that given that the table is currently only updated in one spot, but once task forking is enabled, that will add one more spot. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:10:46 -07:00
Jens Axboe	8768770cf5	io_uring/bpf_filter: allow filtering on contents of struct open_how This adds custom filtering for IORING_OP_OPENAT and IORING_OP_OPENAT2, where the open_how flags, mode, and resolve can be checked by filters. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:10:46 -07:00
Jens Axboe	cff1c26b42	io_uring/net: allow filtering on IORING_OP_SOCKET data Example population method for the BPF based opcode filtering. This exposes the socket family, type, and protocol to a registered BPF filter. This in turn enables the filter to make decisions based on what was passed in to the IORING_OP_SOCKET request type. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:10:46 -07:00
Jens Axboe	d42eb05e60	io_uring: add support for BPF filtering for opcode restrictions Add support for loading classic BPF programs with io_uring to provide fine-grained filtering of SQE operations. Unlike IORING_REGISTER_RESTRICTIONS which only allows bitmap-based allow/deny of opcodes, BPF filters can inspect request attributes and make dynamic decisions. The filter is registered via IORING_REGISTER_BPF_FILTER with a struct io_uring_bpf: struct io_uring_bpf_filter { __u32 opcode; /* io_uring opcode to filter / __u32 flags; __u32 filter_len; / number of BPF instructions / __u32 resv; __u64 filter_ptr; / pointer to BPF filter / __u64 resv2[5]; }; enum { IO_URING_BPF_CMD_FILTER = 1, }; struct io_uring_bpf { __u16 cmd_type; / IO_URING_BPF_* values / __u16 cmd_flags; / none so far */ __u32 resv; union { struct io_uring_bpf_filter filter; }; }; and the filters get supplied a struct io_uring_bpf_ctx: struct io_uring_bpf_ctx { __u64 user_data; __u8 opcode; __u8 sqe_flags; __u8 pdu_size; __u8 pad[5]; }; where it's possible to filter on opcode and sqe_flags, with pdu_size indicating how much extra data is being passed in beyond the pad field. This will used for specific finer grained filtering inside an opcode. An example of that for sockets is in one of the following patches. Anything the opcode supports can end up in this struct, populated by the opcode itself, and hence can be filtered for. Filters have the following semantics: - Return 1 to allow the request - Return 0 to deny the request with -EACCES - Multiple filters can be stacked per opcode. All filters must return 1 for the opcode to be allowed. - Filters are evaluated in registration order (most recent first) The implementation uses classic BPF (cBPF) rather than eBPF for as that's required for containers, and since they can be used by any user in the system. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:09:57 -07:00
Jens Axboe	e26f51f6f6	io_uring/rsrc: use GFP_KERNEL_ACCOUNT consistently For potential long term allocations, ensure that we play nicer with memcg and use the accounting variant of the GFP_KERNEL allocation. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-25 10:07:35 -07:00
Jens Axboe	6e0d71c288	io_uring/futex: use GFP_KERNEL_ACCOUNT for futex data allocation Be a bit nicer and ensure it plays nice with memcg accounting. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-25 10:07:09 -07:00
Pavel Begunkov	795663b4d1	io_uring/zcrx: implement large rx buffer support There are network cards that support receive buffers larger than 4K, and that can be vastly beneficial for performance, and benchmarks for this patch showed up to 30% CPU util improvement for 32K vs 4K buffers. Allows zcrx users to specify the size in struct io_uring_zcrx_ifq_reg::rx_buf_len. If set to zero, zcrx will use a default value. zcrx will check and fail if the memory backing the area can't be split into physically contiguous chunks of the required size. It's more restrictive as it only needs dma addresses to be contig, but that's beyond this series. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: kill duplicate netdev_queues.h include] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-24 08:33:03 -07:00
Jens Axboe	816095894c	io_uring/io-wq: handle !sysctl_hung_task_timeout_secs If the hung_task_timeout sysctl is set to 0, then we'll end up busy looping inside io_wq_exit_workers() after an earlier commit switched to using wait_for_completion_timeout(). Use the maximum schedule timeout value for that case. Fixes: `1f293098a3` ("io_uring/io-wq: don't trigger hung task for syzbot craziness") Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-23 13:58:03 -07:00
Linus Torvalds	7907f673d0	Merge tag 'io_uring-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Fix for a potential leak of an iovec, if a specific cleanup path is used and the rw_cache is full at the time of the call - Fix for a regression added in this cycle, where waitid should be using prober release/acquire semantics for updating the wait queue head - Check for the cancelation bit being set for every work item processed by io-wq, not just at the start of the loop. Has no real practical implications other than to shut up syzbot doing crazy things that grossly overload a system, hence slowing down ring exit - A few selftest additions, updating the mini_liburing that selftests use * tag 'io_uring-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: selftests/io_uring: support NO_SQARRAY in miniliburing selftests/io_uring: add io_uring_queue_init_params io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop io_uring/waitid: fix KCSAN warning on io_waitid->head io_uring/rw: free potentially allocated iovec on cache put failure	2026-01-23 12:51:00 -08:00
Jens Axboe	1edf0891d0	io_uring: fix bad indentation for setup flags if statement smatch complains about this: smatch warnings: io_uring/io_uring.c:2741 io_uring_sanitise_params() warn: if statement not indented hence fix it up. Link: https://lore.kernel.org/all/202601231651.HeTmPS8C-lkp@intel.com/ Fixes: `5247c034a6` ("io_uring: introduce non-circular SQ") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601231651.HeTmPS8C-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-23 05:09:08 -07:00
Caleb Sander Mateos	82dadc8a49	io_uring/rsrc: take unsigned index in io_rsrc_node_lookup() io_rsrc_node_lookup() takes a signed int index as input and compares it to an unsigned length. Since the signed int is implicitly cast to an unsigned int for the comparison and the length is bounded by IORING_MAX_FIXED_FILES/IORING_MAX_REG_BUFFERS, negative indices are already rejected on architectures where int is at least 32 bits. Make this a bit clearer and avoid compiler warnings for comparisons of signed and unsigned values by taking an unsigned int index instead. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 15:58:17 -07:00
Pavel Begunkov	5247c034a6	io_uring: introduce non-circular SQ Outside of SQPOLL, normally SQ entries are consumed by the time the submission syscall returns. For those cases we don't need a circular buffer and the head/tail tracking, instead the kernel can assume that entries always start from the beginning of the SQ at index 0. This patch introduces a setup flag doing exactly that. It's a simpler and helps to keeps SQEs hot in cache. The feature is optional and enabled by setting IORING_SETUP_SQ_REWIND. The flag is rejected if passed together with SQPOLL as it'd require waiting for SQ before each submission. It also requires IORING_SETUP_NO_SQARRAY, which can be supported but it's unlikely there will be users, so leave more space for future optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 15:47:23 -07:00
Jens Axboe	0105b0562a	io_uring: split out CQ waiting code into wait.c Move the completion queue waiting and scheduling code out of io_uring.c into a dedicated wait.c file. This further removes code out of the main io_uring C and header file, and into a topical new file. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 09:21:16 -07:00
Jens Axboe	7642e66860	io_uring: split out task work code into tw.c Move the task work handling code out of io_uring.c into a new tw.c file. This includes the local work, normal work, and fallback work handling infrastructure. The associated tw.h header contains io_should_terminate_tw() as a static inline helper, along with the necessary function declarations. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 09:20:17 -07:00
Jens Axboe	1f293098a3	io_uring/io-wq: don't trigger hung task for syzbot craziness Use the same trick that blk_io_schedule() does to avoid triggering the hung task warning (and potential reboot/panic, depending on system settings), and only wait for half the hung task timeout at the time. If we exceed the default IO_URING_EXIT_WAIT_MAX period where we expect things to certainly have finished unless there's a bug, then throw a WARN_ON_ONCE() for that case. Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com Tested-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 07:25:35 -07:00
Jens Axboe	dd120bddc4	io_uring: add IO_URING_EXIT_WAIT_MAX definition Add the timeout we normally wait before complaining about things being stuck waiting for cancelations to complete as a define, and use it in io_ring_exit_work(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 07:25:30 -07:00
Jens Axboe	649dd18f55	io_uring/sync: validate passed in offset Check if the passed in offset is negative once cast to sync->off. This ensures that -EINVAL is returned for that case, like it would be for sync_file_range(2). Fixes: `c992fe2925` ("io_uring: add fsync support") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-21 11:50:59 -07:00
Ming Lei	f7bc22ca0d	nvme/io_uring: optimize IOPOLL completions for local ring context When multiple io_uring rings poll on the same NVMe queue, one ring can find completions belonging to another ring. The current code always uses task_work to handle this, but this adds overhead for the common single-ring case. This patch passes the polling io_ring_ctx through io_comp_batch's new poll_ctx field. In io_do_iopoll(), the polling ring's context is stored in iob.poll_ctx before calling the iopoll callbacks. In nvme_uring_cmd_end_io(), we now compare iob->poll_ctx with the request's owning io_ring_ctx (via io_uring_cmd_ctx_handle()). If they match (local context), we complete inline with io_uring_cmd_done32(). If they differ (remote context) or iob is NULL (non-iopoll path), we use task_work as before. This optimization eliminates task_work scheduling overhead for the common case where a ring polls and finds its own completions. ~10% IOPS improvement is observed in the following benchmark: fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -O0 -P1 -u1 -n1 /dev/ng0n1 Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-20 10:18:01 -07:00
Jens Axboe	42b12cb5fd	io_uring/timeout: annotate data race in io_flush_timeouts() syzbot correctly reports this as a KCSAN race, as ctx->cached_cq_tail should be read under ->uring_lock. This isn't immediately feasible in io_flush_timeouts(), but as long as we read a stable value, that should be good enough. If two io-wq threads compete on this value, then they will both end up calling io_flush_timeouts() and at least one of them will see the correct value. Reported-by: syzbot+6c48db7d94402407301e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-20 09:54:17 -07:00
Jens Axboe	10dc959398	io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop Currently this is checked before running the pending work. Normally this is quite fine, as work items either end up blocking (which will create a new worker for other items), or they complete fairly quickly. But syzbot reports an issue where io-wq takes seemingly forever to exit, and with a bit of debugging, this turns out to be because it queues a bunch of big (2GB - 4096b) reads with a /dev/msr* file. Since this file type doesn't support ->read_iter(), loop_rw_iter() ends up handling them. Each read returns 16MB of data read, which takes 20 (!!) seconds. With a bunch of these pending, processing the whole chain can take a long time. Easily longer than the syzbot uninterruptible sleep timeout of 140 seconds. This then triggers a complaint off the io-wq exit path: INFO: task syz.4.135:6326 blocked for more than 143 seconds. Not tainted syzkaller #0 Blocked by coredump. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.135 state:D stack:26824 pid:6326 tgid:6324 ppid:5957 task_flags:0x400548 flags:0x00080000 Call Trace: <TASK> context_switch kernel/sched/core.c:5256 [inline] __schedule+0x1139/0x6150 kernel/sched/core.c:6863 __schedule_loop kernel/sched/core.c:6945 [inline] schedule+0xe7/0x3a0 kernel/sched/core.c:6960 schedule_timeout+0x257/0x290 kernel/time/sleep_timeout.c:75 do_wait_for_common kernel/sched/completion.c:100 [inline] __wait_for_common+0x2fc/0x4e0 kernel/sched/completion.c:121 io_wq_exit_workers io_uring/io-wq.c:1328 [inline] io_wq_put_and_exit+0x271/0x8a0 io_uring/io-wq.c:1356 io_uring_clean_tctx+0x10d/0x190 io_uring/tctx.c:203 io_uring_cancel_generic+0x69c/0x9a0 io_uring/cancel.c:651 io_uring_files_cancel include/linux/io_uring.h:19 [inline] do_exit+0x2ce/0x2bd0 kernel/exit.c:911 do_group_exit+0xd3/0x2a0 kernel/exit.c:1112 get_signal+0x2671/0x26d0 kernel/signal.c:3034 arch_do_signal_or_restart+0x8f/0x7e0 arch/x86/kernel/signal.c:337 __exit_to_user_mode_loop kernel/entry/common.c:41 [inline] exit_to_user_mode_loop+0x8c/0x540 kernel/entry/common.c:75 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline] do_syscall_64+0x4ee/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa02738f749 RSP: 002b:00007fa0281ae0e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 00007fa0275e6098 RCX: 00007fa02738f749 RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fa0275e6098 RBP: 00007fa0275e6090 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa0275e6128 R14: 00007fff14e4fcb0 R15: 00007fff14e4fd98 There's really nothing wrong here, outside of processing these reads will take a LONG time. However, we can speed up the exit by checking the IO_WQ_BIT_EXIT inside the io_worker_handle_work() loop, as syzbot will exit the ring after queueing up all of these reads. Then once the first item is processed, io-wq will simply cancel the rest. That should avoid syzbot running into this complaint again. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/68a2decc.050a0220.e29e5.0099.GAE@google.com/ Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-20 07:55:59 -07:00
Jens Axboe	b994ace83a	io_uring/waitid: fix KCSAN warning on io_waitid->head Storing of the iw->head entry inside the wait_queue callback, or when removing a waitid item, really should use proper load/store acquire/release semantics, and KCSAN correctly warns of that. Ensure that they do so. Reported-by: syzbot+eb441775f4f948a0902f@syzkaller.appspotmail.com Fixes: `a48c0cbf28` ("io_uring/waitid: have io_waitid_complete() remove wait queue entry") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-19 19:55:30 -07:00
Jens Axboe	4b97480554	io_uring/rw: free potentially allocated iovec on cache put failure If a read/write request goes through io_req_rw_cleanup() and has an allocated iovec attached and fails to put to the rw_cache, then it may end up with an unaccounted iovec pointer. Have io_rw_recycle() return whether it recycled the request or not, and use that to gauge whether to free a potential iovec or not. Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-19 06:59:06 -07:00
Linus Torvalds	216c7a0326	Merge tag 'io_uring-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "Just a single fix moving local task_work inside the cancelation loop, rather than only before cancelations. If any cancelations generate task_work, we do need to re-run it" * tag 'io_uring-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: move local task_work in exit cancel loop	2026-01-16 20:56:56 -08:00
Al Viro	5b9d406ff7	filename_...xattr(): don't consume filename reference Callers switched to CLASS(filename_maybe_null) (in fs/xattr.c) and CLASS(filename_complete_delayed) (in io_uring/xattr.c). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:52:03 -05:00
Al Viro	e50aae1d39	non-consuming variants of do_{unlinkat,rmdir}() similar to previous commit; replacements are filename_{unlinkat,rmdir}() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:51:50 -05:00
Al Viro	dc912db15a	non-consuming variant of do_mkdirat() similar to previous commit; replacement is filename_mkdirat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:48:49 -05:00
Al Viro	da72b76aae	non-consuming variant of do_symlinkat() similar to previous commit; replacement is filename_symlinkat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:48:16 -05:00
Al Viro	037193b0ae	non-consuming variant of do_linkat() similar to previous commit; replacement is filename_linkat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:47:42 -05:00
Al Viro	e6d50234cc	non-consuming variant of do_renameat2() filename_renameat2() replaces do_renameat2(); unlike the latter, it does not drop filename references - these days it can be just as easily arranged in the caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:46:57 -05:00
Jens Axboe	8661d0b142	io_uring/uring_cmd: explicitly disallow cancelations for IOPOLL This currently isn't supported, and due to a recent commit, it also cannot easily be supported by io_uring due to hash_node and IOPOLL completion data overlapping. This can be revisited if we ever do support cancelations of requests that have gone to the block stack. Suggested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-14 22:04:11 -07:00
Jens Axboe	697a5284ad	io_uring: fix IOPOLL with passthrough I/O A previous commit improving IOPOLL made an incorrect assumption that task_work isn't used with IOPOLL. This can cause crashes when doing passthrough I/O on nvme, where queueing the completion task_work will trample on the same memory that holds the completed list of requests. Fix it up by shuffling the members around, so we're not sharing any parts that end up getting used in this path. Fixes: `3c7d76d612` ("io_uring: IOPOLL polling improvements") Reported-by: Yi Zhang <yi.zhang@redhat.com> Link: https://lore.kernel.org/linux-block/CAHj4cs_SLPj9v9w5MgfzHKy+983enPx3ZQY2kMuMJ1202DBefw@mail.gmail.com/ Tested-by: Yi Zhang <yi.zhang@redhat.com> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-14 22:03:49 -07:00
Ming Lei	da579f05ef	io_uring: move local task_work in exit cancel loop With IORING_SETUP_DEFER_TASKRUN, task work is queued to ctx->work_llist (local work) rather than the fallback list. During io_ring_exit_work(), io_move_task_work_from_local() was called once before the cancel loop, moving work from work_llist to fallback_llist. However, task work can be added to work_llist during the cancel loop itself. There are two cases: 1) io_kill_timeouts() is called from io_uring_try_cancel_requests() to cancel pending timeouts, and it adds task work via io_req_queue_tw_complete() for each cancelled timeout: 2) URING_CMD requests like ublk can be completed via io_uring_cmd_complete_in_task() from ublk_queue_rq() during canceling, given ublk request queue is only quiesced when canceling the 1st uring_cmd. Since io_allowed_defer_tw_run() returns false in io_ring_exit_work() (kworker != submitter_task), io_run_local_work() is never invoked, and the work_llist entries are never processed. This causes io_uring_try_cancel_requests() to loop indefinitely, resulting in 100% CPU usage in kworker threads. Fix this by moving io_move_task_work_from_local() inside the cancel loop, ensuring any work on work_llist is moved to fallback before each cancel attempt. Cc: stable@vger.kernel.org Fixes: `c0e0d6ba25` ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-14 10:18:19 -07:00
Al Viro	541003b576	rename do_filp_open() to do_file_open() "filp" thing never made sense; seeing that there are exactly 4 callers in the entire tree (and it's neither exported nor even declared in linux//.h), there's no point keeping that ugliness. FWIW, the 'filp' thing did originate in OSD&I; for some reason Tanenbaum decided to call the object representing an opened file 'struct filp', the last letter standing for 'position'. In all Unices, Linux included, the corresponding object had always been 'struct file'... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:07 -05:00
Al Viro	9fa3ec8458	allow incomplete imports of filenames There are two filename-related problems in io_uring and its interplay with audit. Filenames are imported when request is submitted and used when it is processed. Unfortunately, the latter may very well happen in a different thread. In that case the reference to filename is put into the wrong audit_context - that of submitting thread, not the processing one. Audit logics is called by the latter, and it really wants to be able to find the names in audit_context current (== processing) thread. Another related problem is the headache with refcounts - normally all references to given struct filename are visible only to one thread (the one that uses that struct filename). io_uring violates that - an extra reference is stashed in audit_context of submitter. It gets dropped when submitter returns to userland, which can happen simultaneously with processing thread deciding to drop the reference it got. We paper over that by making refcount atomic, but that means pointless headache for everyone. Solution: the notion of partially imported filenames. Namely, already copied from userland, but not exposed to audit yet. io_uring can create that in submitter thread, and complete the import (obtaining the usual reference to struct filename) in processing thread. Object: struct delayed_filename. Primitives for working with it: delayed_getname(&delayed_filename, user_string) - copies the name from userland, returning 0 and stashing the address of (still incomplete) struct filename in delayed_filename on success and returning -E... on error. delayed_getname_uflags(&delayed_filename, user_string, atflags) - similar, in the same relation to delayed_getname() as getname_uflags() is to getname() complete_getname(&delayed_filename) - completes the import of filename stashed in delayed_filename and returns struct filename to caller, emptying delayed_filename. CLASS(filename_complete_delayed, name)(&delayed_filename) - variant of CLASS(filename) with complete_getname() for constructor. dismiss_delayed_filename(&delayed_filename) - destructor; drops whatever might be stashed in delayed_filename, emptying it. putname_to_delayed(&delayed_filename, name) - if name is shared, stashes its copy into delayed_filename and drops the reference to name, otherwise stashes the name itself in there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:07 -05:00
Jens Axboe	d6406c45f1	io_uring: track restrictions separately for IORING_OP and IORING_REGISTER It's quite likely that only register opcode restrictions exists, in which case we'd never need to check the normal opcodes. Split ctx->restricted into two separate fields, one for I/O opcodes, and one for register opcodes. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:31:48 -07:00
Jens Axboe	991fb85a1d	io_uring: move ctx->restricted check into io_check_restriction() Just a cleanup, makes the code easier to read without too many dependent nested checks. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:31:48 -07:00
Jens Axboe	09bd84421d	io_uring/register: set ctx->restricted when restrictions are parsed Rather than defer this until the rings are enabled, just set it upfront when the restrictions are parsed and enabled anyway. There's no reason to defer this setting until the rings are enabled. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:31:48 -07:00
Jens Axboe	e6ed0f051d	io_uring/register: have io_parse_restrictions() set restrictions enabled Rather than leave this to the caller, have io_parse_restrictions() set ->registered = true if restrictions have been enabled. This is in preparation for having finer grained restrictions. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:30:41 -07:00
Jens Axboe	51fff55a66	io_uring/register: have io_parse_restrictions() return number of ops Rather than return 0 on success, return >= 0 for success, where the return value is that number of parsed entries. As before, any < 0 return is an error. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:30:41 -07:00

1 2 3 4 5 ...

2031 Commits