Commit Graph

50905 Commits

Author SHA1 Message Date
Kees Cook
189f164e57 Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses
Conversion performed via this Coccinelle script:

  // SPDX-License-Identifier: GPL-2.0-only
  // Options: --include-headers-for-types --all-includes --include-headers --keep-comments
  virtual patch

  @gfp depends on patch && !(file in "tools") && !(file in "samples")@
  identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
 		    kzalloc_obj,kzalloc_objs,kzalloc_flex,
		    kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
		    kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
  @@

  	ALLOC(...
  -		, GFP_KERNEL
  	)

  $ make coccicheck MODE=patch COCCI=gfp.cocci

Build and boot tested x86_64 with Fedora 42's GCC and Clang:

Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01

Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22 08:26:33 -08:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
68010e7b3d Merge tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Fix possible dereference of uninitialized pointer

   When validating the persistent ring buffer on boot up, if the first
   validation fails, a reference to "head_page" is performed in the
   error path, but it skips over the initialization of that variable.
   Move the initialization before the first validation check.

 - Fix use of event length in validation of persistent ring buffer

   On boot up, the persistent ring buffer is checked to see if it is
   valid by several methods. One being to walk all the events in the
   memory location to make sure they are all valid. The length of the
   event is used to move to the next event. This length is determined by
   the data in the buffer. If that length is corrupted, it could
   possibly make the next event to check located at a bad memory
   location.

   Validate the length field of the event when doing the event walk.

 - Fix function graph on archs that do not support use of ftrace_ops

   When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means
   that its function graph tracer uses the ftrace_ops of the function
   tracer to call its callbacks. This allows a single registered
   callback to be called directly instead of checking the callback's
   meta data's hash entries against the function being traced.

   For architectures that do not support this feature, it must always
   call the loop function that tests each registered callback (even if
   there's only one). The loop function tests each callback's meta data
   against its hash of functions and will call its callback if the
   function being traced is in its hash map.

   The issue was that there was no check against this and the direct
   function was being called even if the architecture didn't support it.
   This meant that if function tracing was enabled at the same time as a
   callback was registered with the function graph tracer, its callback
   would be called for every function that the function tracer also
   traced, even if the callback's meta data only wanted to be called
   back for a small subset of functions.

   Prevent the direct calling for those architectures that do not
   support it.

 - Fix references to trace_event_file for hist files

   The hist files used event_file_data() to get a reference to the
   associated trace_event_file the histogram was attached to. This would
   return a pointer even if the trace_event_file is about to be freed
   (via RCU). Instead it should use the event_file_file() helper that
   returns NULL if the trace_event_file is marked to be freed so that no
   new references are added to it.

 - Wake up hist poll readers when an event is being freed

   When polling on a hist file, the task is only awoken when a hist
   trigger is triggered. This means that if an event is being freed
   while there's a task waiting on its hist file, it will need to wait
   until the hist trigger occurs to wake it up and allow the freeing to
   happen. Note, the event will not be completely freed until all
   references are removed, and a hist poller keeps a reference. But it
   should still be woken when the event is being freed.

* tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Wake up poll waiters for hist files when removing an event
  tracing: Fix checking of freed trace_event_file for hist files
  fgraph: Do not call handlers direct when not using ftrace_ops
  tracing: ring-buffer: Fix to check event length before using
  ring-buffer: Fix possible dereference of uninitialized pointer
2026-02-20 15:05:26 -08:00
Petr Pavlu
9678e53179 tracing: Wake up poll waiters for hist files when removing an event
The event_hist_poll() function attempts to verify whether an event file is
being removed, but this check may not occur or could be unnecessarily
delayed. This happens because hist_poll_wakeup() is currently invoked only
from event_hist_trigger() when a hist command is triggered. If the event
file is being removed, no associated hist command will be triggered and a
waiter will be woken up only after an unrelated hist command is triggered.

Fix the issue by adding a call to hist_poll_wakeup() in
remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This
ensures that a task polling on a hist file is woken up and receives
EPOLLERR.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com
Fixes: 1bd13edbbe ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:25:11 -05:00
Petr Pavlu
f0a0da1f90 tracing: Fix checking of freed trace_event_file for hist files
The event_hist_open() and event_hist_poll() functions currently retrieve
a trace_event_file pointer from a file struct by invoking
event_file_data(), which simply returns file->f_inode->i_private. The
functions then check if the pointer is NULL to determine whether the event
is still valid. This approach is flawed because i_private is assigned when
an eventfs inode is allocated and remains set throughout its lifetime.
Instead, the code should call event_file_file(), which checks for
EVENT_FILE_FL_FREED. Using the incorrect access function may result in the
code potentially opening a hist file for an event that is being removed or
becoming stuck while polling on this file.

Correct the access method to event_file_file() in both functions.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com
Fixes: 1bd13edbbe ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:23:49 -05:00
Steven Rostedt
f4ff9f646a fgraph: Do not call handlers direct when not using ftrace_ops
The function graph tracer was modified to us the ftrace_ops of the
function tracer. This simplified the code as well as allowed more features
of the function graph tracer.

Not all architectures were converted over as it required the
implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those
architectures, it still did it the old way where the function graph tracer
handle was called by the function tracer trampoline. The handler then had
to check the hash to see if the registered handlers wanted to be called by
that function or not.

In order to speed up the function graph tracer that used ftrace_ops, if
only one callback was registered with function graph, it would call its
function directly via a static call.

Now, if the architecture does not support the use of using ftrace_ops and
still has the ftrace function trampoline calling the function graph
handler, then by doing a direct call it removes the check against the
handler's hash (list of functions it wants callbacks to), and it may call
that handler for functions that the handler did not request calls for.

On 32bit x86, which does not support the ftrace_ops use with function
graph tracer, it shows the issue:

 ~# trace-cmd start -p function -l schedule
 ~# trace-cmd show
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
  2) * 11898.94 us |  schedule();
  3) # 1783.041 us |  schedule();
  1)               |  schedule() {
  ------------------------------------------
  1)   bash-8369    =>  kworker-7669
  ------------------------------------------
  1)               |        schedule() {
  ------------------------------------------
  1)  kworker-7669  =>   bash-8369
  ------------------------------------------
  1) + 97.004 us   |  }
  1)               |  schedule() {
 [..]

Now by starting the function tracer is another instance:

 ~# trace-cmd start -B foo -p function

This causes the function graph tracer to trace all functions (because the
function trace calls the function graph tracer for each on, and the
function graph trace is doing a direct call):

 ~# trace-cmd show
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
  1)   1.669 us    |          } /* preempt_count_sub */
  1) + 10.443 us   |        } /* _raw_spin_unlock_irqrestore */
  1)               |        tick_program_event() {
  1)               |          clockevents_program_event() {
  1)   1.044 us    |            ktime_get();
  1)   6.481 us    |            lapic_next_event();
  1) + 10.114 us   |          }
  1) + 11.790 us   |        }
  1) ! 181.223 us  |      } /* hrtimer_interrupt */
  1) ! 184.624 us  |    } /* __sysvec_apic_timer_interrupt */
  1)               |    irq_exit_rcu() {
  1)   0.678 us    |      preempt_count_sub();

When it should still only be tracing the schedule() function.

To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the
architecture does not support function graph use of ftrace_ops, and set to
1 otherwise. Then use this macro to know to allow function graph tracer to
call the handlers directly or not.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home
Fixes: cc60ee813b ("function_graph: Use static_call and branch to optimize entry function")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:21:22 -05:00
Masami Hiramatsu (Google)
912b0ee248 tracing: ring-buffer: Fix to check event length before using
Check the event length before adding it for accessing next index in
rb_read_data_buffer(). Since this function is used for validating
possibly broken ring buffers, the length of the event could be broken.
In that case, the new event (e + len) can point a wrong address.
To avoid invalid memory access at boot, check whether the length of
each event is in the possible range before using it.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Link: https://patch.msgid.link/177123421541.142205.9414352170164678966.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:21:12 -05:00
Daniil Dulov
f154777940 ring-buffer: Fix possible dereference of uninitialized pointer
There is a pointer head_page in rb_meta_validate_events() which is not
initialized at the beginning of a function. This pointer can be dereferenced
if there is a failure during reader page validation. In this case the control
is passed to "invalid" label where the pointer is dereferenced in a loop.

To fix the issue initialize orig_head and head_page before calling
rb_validate_buffer.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260213100130.2013839-1-d.dulov@aladdin.ru
Closes: https://lore.kernel.org/r/202406130130.JtTGRf7W-lkp@intel.com/
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:20:41 -05:00
Linus Torvalds
4f13d0dabc Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:

 - Fix invalid write loop logic in libbpf's bpf_linker__add_buf() (Amery
   Hung)

 - Fix a potential use-after-free of BTF object (Anton Protopopov)

 - Add feature detection to libbpf and avoid moving arena global
   variables on older kernels (Emil Tsalapatis)

 - Remove extern declaration of bpf_stream_vprintk() from libbpf headers
   (Ihor Solodrai)

 - Fix truncated netlink dumps in bpftool (Jakub Kicinski)

 - Fix map_kptr grace period wait in bpf selftests (Kumar Kartikeya
   Dwivedi)

 - Remove hexdump dependency while building bpf selftests (Matthieu
   Baerts)

 - Complete fsession support in BPF trampolines on riscv (Menglong Dong)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Remove hexdump dependency
  libbpf: Remove extern declaration of bpf_stream_vprintk()
  selftests/bpf: Use vmlinux.h in test_xdp_meta
  bpftool: Fix truncated netlink dumps
  libbpf: Delay feature gate check until object prepare time
  libbpf: Do not use PROG_TYPE_TRACEPOINT program for feature gating
  bpf: Add a map/btf from a fd array more consistently
  selftests/bpf: Fix map_kptr grace period wait
  selftests/bpf: enable fsession_test on riscv64
  selftests/bpf: Adjust selftest due to function rename
  bpf, riscv: add fsession support for trampolines
  bpf: Fix a potential use-after-free of BTF object
  bpf, riscv: introduce emit_store_stack_imm64() for trampoline
  libbpf: Fix invalid write loop logic in bpf_linker__add_buf()
  libbpf: Add gating for arena globals relocation feature
2026-02-19 10:36:54 -08:00
Linus Torvalds
2b7a25df82 Merge tag 'mm-nonmm-stable-2026-02-18-19-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more non-MM updates from Andrew Morton:

 - "two fixes in kho_populate()" fixes a couple of not-major issues in
   the kexec handover code (Ran Xiaokai)

 - misc singletons

* tag 'mm-nonmm-stable-2026-02-18-19-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  lib/group_cpus: handle const qualifier from clusters allocation type
  kho: remove unnecessary WARN_ON(err) in kho_populate()
  kho: fix missing early_memunmap() call in kho_populate()
  scripts/gdb: implement x86_page_ops in mm.py
  objpool: fix the overestimation of object pooling metadata size
  selftests/memfd: use IPC semaphore instead of SIGSTOP/SIGCONT
  delayacct: fix build regression on accounting tool
2026-02-18 21:40:16 -08:00
Linus Torvalds
eeccf287a2 Merge tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM  updates from Andrew Morton:

 - "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a
   couple of issues in the demotion code - pages were failed demotion
   and were finding themselves demoted into disallowed nodes (Bing Jiao)

 - "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare
   mapledtree race and performs a number of cleanups (Liam Howlett)

 - "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use
   them" implements a lot of cleanups following on from the conversion
   of the VMA flags into a bitmap (Lorenzo Stoakes)

 - "support batch checking of references and unmapping for large folios"
   implements batching to greatly improve the performance of reclaiming
   clean file-backed large folios (Baolin Wang)

 - "selftests/mm: add memory failure selftests" does as claimed (Miaohe
   Lin)

* tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits)
  mm/page_alloc: clear page->private in free_pages_prepare()
  selftests/mm: add memory failure dirty pagecache test
  selftests/mm: add memory failure clean pagecache test
  selftests/mm: add memory failure anonymous page test
  mm: rmap: support batched unmapping for file large folios
  arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  arm64: mm: support batch clearing of the young flag for large folios
  arm64: mm: factor out the address and ptep alignment into a new helper
  mm: rmap: support batched checks of the references for large folios
  tools/testing/vma: add VMA userland tests for VMA flag functions
  tools/testing/vma: separate out vma_internal.h into logical headers
  tools/testing/vma: separate VMA userland tests into separate files
  mm: make vm_area_desc utilise vma_flags_t only
  mm: update all remaining mmap_prepare users to use vma_flags_t
  mm: update shmem_[kernel]_file_*() functions to use vma_flags_t
  mm: update secretmem to use VMA flags on mmap_prepare
  mm: update hugetlbfs to use VMA flags on mmap_prepare
  mm: add basic VMA flag operation helper functions
  tools: bitmap: add missing bitmap_[subset(), andnot()]
  mm: add mk_vma_flags() bitmap flag macro helper
  ...
2026-02-18 20:50:32 -08:00
Linus Torvalds
23b0f90ba8 Merge tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:

 - Remove macros from proc handler converters

   Replace the proc converter macros with "regular" functions. Though it
   is more verbose than the macro version, it helps when debugging and
   better aligns with coding-style.rst.

 - General cleanup

   Remove superfluous ctl_table forward declarations. Const qualify the
   memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays.
   Add missing kernel doc to proc_dointvec_conv.

 - Testing

   This series was run through sysctl selftests/kunit test suite in
   x86_64. And went into linux-next after rc4, giving it a good 3 weeks
   of testing

* tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions
  sysctl: Replace unidirectional INT converter macros with functions
  sysctl: Add kernel doc to proc_douintvec_conv
  sysctl: Replace UINT converter macros with functions
  sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros
  sysctl: clarify proc_douintvec_minmax doc
  sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n
  sysctl: Remove unused ctl_table forward declarations
  loadpin: Implement custom proc_handler for enforce
  alloc_tag: move memory_allocation_profiling_sysctls into .rodata
  sysctl: Add missing kernel-doc for proc_dointvec_conv
2026-02-18 10:45:36 -08:00
Linus Torvalds
d295082ea6 Merge tag 'spdx-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull SPDX updates from Greg KH:
 "Here are two small changes that add some missing SPDX license lines to
  some core kernel files. These are:

   - adding SPDX license lines to kdb files

   - adding SPDX license lines to the remaining kernel/ files

  Both of these have been in linux-next for a while with no reported
  issues"

* tag 'spdx-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
  kernel: debug: Add SPDX license ids to kdb files
  kernel: add SPDX-License-Identifier lines
2026-02-17 09:46:03 -08:00
Linus Torvalds
99dfe2d4da Merge tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull more block updates from Jens Axboe:

 - Fix partial IOVA mapping cleanup in error handling

 - Minor prep series ignoring discard return value, as
   the inline value is always known

 - Ensure BLK_FEAT_STABLE_WRITES is set for drbd

 - Fix leak of folio in bio_iov_iter_bounce_read()

 - Allow IOC_PR_READ_* for read-only open

 - Another debugfs deadlock fix

 - A few doc updates

* tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  blk-mq: use NOIO context to prevent deadlock during debugfs creation
  blk-stat: convert struct blk_stat_callback to kernel-doc
  block: fix enum descriptions kernel-doc
  block: update docs for bio and bvec_iter
  block: change return type to void
  nvmet: ignore discard return value
  md: ignore discard return value
  block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova
  block: fix folio leak in bio_iov_iter_bounce_read()
  block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ
  drbd: always set BLK_FEAT_STABLE_WRITES
2026-02-17 08:48:45 -08:00
Linus Torvalds
543b9b6339 Merge tag 'kernel-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:

 - pid: introduce task_ppid_vnr() helper

 - pidfs: convert rb-tree to rhashtable

   Mateusz reported performance penalties during task creation because
   pidfs uses pidmap_lock to add elements into the rbtree. Switch to an
   rhashtable to have separate fine-grained locking and to decouple from
   pidmap_lock moving all heavy manipulations outside of it

   Also move inode allocation outside of pidmap_lock. With this there's
   nothing happening for pidfs under pidmap_lock

 - pid: reorder fields in pid_namespace to reduce false sharing

 - Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie
   callers"

 - ipc: Add SPDX license id to mqueue.c

* tag 'kernel-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  pid: introduce task_ppid_vnr() helper
  pidfs: implement ino allocation without the pidmap lock
  Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers"
  pid: reorder fields in pid_namespace to reduce false sharing
  pidfs: convert rb-tree to rhashtable
  ipc: Add SPDX license id to mqueue.c
2026-02-16 12:37:13 -08:00
Yu Kuai
dfe48ea179 blk-mq: use NOIO context to prevent deadlock during debugfs creation
Creating debugfs entries can trigger fs reclaim, which can enter back
into the block layer request_queue. This can cause deadlock if the
queue is frozen.

Previously, a WARN_ON_ONCE check was used in debugfs_create_files()
to detect this condition, but it was racy since the queue can be frozen
from another context at any time.

Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine
the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs
reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave()
and blk_debugfs_unlock_nomemrestore() variants for callers that don't
need NOIO protection (e.g., debugfs removal or read-only operations).

Replace all raw debugfs_mutex lock/unlock pairs with these helpers,
using the _nomemsave/_nomemrestore variants where appropriate.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16 10:47:25 -07:00
Linus Torvalds
2d10a48871 Merge tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull kprobes updates from Masami Hiramatsu:

 - Use a dedicated kernel thread to optimize the kprobes instead of
   using workqueue thread. Since the kprobe optimizer waits a long time
   for synchronize_rcu_task(), it can block other workers in the same
   queue if it uses a workqueue.

 - kprobe-events: return immediately if no new probe events are
   specified on the kernel command line at boot time. This shortens
   the kernel boot time.

 - When a kprobe is fully removed from the kernel code, retry optimizing
   another kprobe which is blocked by that kprobe.

* tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Use dedicated kthread for kprobe optimizer
  tracing: kprobe-event: Return directly when trace kprobes is empty
  kprobes: retry blocked optprobe in do_free_cleaned_kprobes
2026-02-16 07:04:01 -08:00
Sebastian Andrzej Siewior
8e9bf8b9e8 printk, vt, fbcon: Remove console_conditional_schedule()
do_con_write(), fbcon_redraw.*() invoke console_conditional_schedule()
which is a conditional scheduling point based on printk's internal
variables console_may_schedule. It may only be used if the console lock
is acquired for instance via console_lock() or console_trylock().

Prinkt sets the internal variable to 1 (and allows to schedule)
if the console lock has been acquired via console_lock(). The trylock
does not allow it.

The console_conditional_schedule() invocation in do_con_write() is
invoked shortly before console_unlock().
The console_conditional_schedule() invocation in fbcon_redraw.*()
original from fbcon_scroll() / vt's con_scroll() which originate from a
line feed.

In console_unlock() the variable is set to 0 (forbids to schedule) and
it tries to schedule while making progress printing. This is brand new
compared to when console_conditional_schedule() was added in v2.4.9.11.

In v2.6.38-rc3, console_unlock() (started its existence) iterated over
all consoles and flushed them with disabled interrupts. A scheduling
attempt here was not possible, it relied that a long print scheduled
before console_unlock().

Since commit 8d91f8b153 ("printk: do cond_resched() between lines
while outputting to consoles"), which appeared in v4.5-rc1,
console_unlock() attempts to schedule if it was allowed to schedule
while during console_lock(). Each record is idealy one line so after
every line feed.

This console_conditional_schedule() is also only relevant on
PREEMPT_NONE and PREEMPT_VOLUNTARY builds. In other configurations
cond_resched() becomes a nop and has no impact.

I'm bringing this all up just proof that it is not required anymore. It
becomes a problem on a PREEMPT_RT build with debug code enabled because
that might_sleep() in cond_resched() remains and triggers a warnings.
This is due to

 legacy_kthread_func-> console_flush_one_record ->  vt_console_print-> lf
   -> con_scroll -> fbcon_scroll

and vt_console_print() acquires a spinlock_t which does not allow a
voluntary schedule. There is no need to fb_scroll() to schedule since
console_flush_one_record() attempts to schedule after each line.
!PREEMPT_RT is not affected because the legacy printing thread is only
enabled on PREEMPT_RT builds.

Therefore I suggest to remove console_conditional_schedule().

Cc: Simona Vetter <simona@ffwll.ch>
Cc: Helge Deller <deller@gmx.de>
Cc: linux-fbdev@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Fixes: 5f53ca3ff8 ("printk: Implement legacy printer kthread for PREEMPT_RT")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Petr Mladek <pmladek@suse.com> # from printk() POV
Signed-off-by: Helge Deller <deller@gmx.de>
2026-02-14 11:09:47 +01:00
Linus Torvalds
3c6e577d5a Merge tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
 "User visible changes:

   - Add an entry into MAINTAINERS file for RUST versions of code

     There's now RUST code for tracing and static branches. To
     differentiate that code from the C code, add entries in for the
     RUST version (with "[RUST]" around it) so that the right
     maintainers get notified on changes.

   - New bitmask-list option added to tracefs

     When this is set, bitmasks in trace event are not displayed as hex
     numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f

   - New show_event_filters file in tracefs

     Instead of having to search all events/*/*/filter for any active
     filters enabled in the trace instance, the file show_event_filters
     will list them so that there's only one file that needs to be
     examined to see if any filters are active.

   - New show_event_triggers file in tracefs

     Instead of having to search all events/*/*/trigger for any active
     triggers enabled in the trace instance, the file
     show_event_triggers will list them so that there's only one file
     that needs to be examined to see if any triggers are active.

   - Have traceoff_on_warning disable trace pintk buffer too

     Recently recording of trace_printk() could go to other trace
     instances instead of the top level instance. But if
     traceoff_on_warning triggers, it doesn't stop the buffer with
     trace_printk() and that data can easily be lost by being
     overwritten. Have traceoff_on_warning also disable the instance
     that has trace_printk() being written to it.

   - Update the hist_debug file to show what function the field uses

     When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file
     exists for every event. This displays the internal data of any
     histogram enabled for that event. But it is lacking the function
     that is called to process one of its fields. This is very useful
     information that was missing when debugging histograms.

   - Up the histogram stack size from 16 to 31

     Stack traces can be used as keys for event histograms. Currently
     the size of the stack that is stored is limited to just 16 entries.
     But the storage space in the histogram is 256 bytes, meaning that
     it can store up to 31 entries (plus one for the count of entries).
     Instead of letting that space go to waste, up the limit from 16 to
     31. This makes the keys much more useful.

   - Fix permissions of per CPU file buffer_size_kb

     The per CPU file of buffer_size_kb was incorrectly set to read only
     in a previous cleanup. It should be writable.

   - Reset "last_boot_info" if the persistent buffer is cleared

     The last_boot_info shows address information of a persistent ring
     buffer if it contains data from a previous boot. It is cleared when
     recording starts again, but it is not cleared when the buffer is
     reset. The data is useless after a reset so clear it on reset too.

  Internal changes:

   - A change was made to allow tracepoint callbacks to have preemption
     enabled, and instead be protected by SRCU. This required some
     updates to the callbacks for perf and BPF.

     perf needed to disable preemption directly in its callback because
     it expects preemption disabled in the later code.

     BPF needed to disable migration, as its code expects to run
     completely on the same CPU.

   - Have irq_work wake up other CPU if current CPU is "isolated"

     When there's a waiter waiting on ring buffer data and a new event
     happens, an irq work is triggered to wake up that waiter. This is
     noisy on isolated CPUs (running NO_HZ_FULL). Trigger an IPI to a
     house keeping CPU instead.

   - Use proper free of trigger_data instead of open coding it in.

   - Remove redundant call of event_trigger_reset_filter()

     It was called immediately in a function that was called right after
     it.

   - Workqueue cleanups

   - Report errors if tracing_update_buffers() were to fail.

   - Make the enum update workqueue generic for other parts of tracing

     On boot up, a work queue is created to convert enum names into
     their numbers in the trace event format files. This work queue can
     also be used for other aspects of tracing that takes some time and
     shouldn't be called by the init call code.

     The blk_trace initialization takes a bit of time. Have the
     initialization code moved to the new tracing generic work queue
     function.

   - Skip kprobe boot event creation call if there's no kprobes defined
     on cmdline

     The kprobe initialization to set up kprobes if they are defined on
     the cmdline requires taking the event_mutex lock. This can be held
     by other tracing code doing initialization for a long time. Since
     kprobes added to the kernel command line need to be setup
     immediately, as they may be tracing early initialization code, they
     cannot be postponed in a work queue and must be setup in the
     initcall code.

     If there's no kprobe on the kernel cmdline, there's no reason to
     take the mutex and slow down the boot up code waiting to get the
     lock only to find out there's nothing to do. Simply exit out early
     if there's no kprobes on the kernel cmdline.

     If there are kprobes on the cmdline, then someone cares more about
     tracing over the speed of boot up.

   - Clean up the trigger code a bit

   - Move code out of trace.c and into their own files

     trace.c is now over 11,000 lines of code and has become more
     difficult to maintain. Start splitting it up so that related code
     is in their own files.

     Move all the trace_printk() related code into trace_printk.c.

     Move the __always_inline stack functions into trace.h.

     Move the pid filtering code into a new trace_pid.c file.

   - Better define the max latency and snapshot code

     The latency tracers have a "max latency" buffer that is a copy of
     the main buffer and gets swapped with it when a new high latency is
     detected. This keeps the trace up to the highest latency around
     where this max_latency buffer is never written to. It is only used
     to save the last max latency trace.

     A while ago a snapshot feature was added to tracefs to allow user
     space to perform the same logic. It could also enable events to
     trigger a "snapshot" if one of their fields hit a new high. This
     was built on top of the latency max_latency buffer logic.

     Because snapshots came later, they were dependent on the latency
     tracers to be enabled. In reality, the latency tracers depend on
     the snapshot code and not the other way around. It was just that
     they came first.

     Restructure the code and the kconfigs to have the latency tracers
     depend on snapshot code instead. This actually simplifies the logic
     a bit and allows to disable more when the latency tracers are not
     defined and the snapshot code is.

   - Fix a "false sharing" in the hwlat tracer code

     The loop to search for latency in hardware was using a variable
     that could be changed by user space for each sample. If the user
     change this variable, it could cause a bus contention, and reading
     that variable can show up as a large latency in the trace causing a
     false positive. Read this variable at the start of the sample with
     a READ_ONCE() into a local variable and keep the code from sharing
     cache lines with readers.

   - Fix function graph tracer static branch optimization code

     When only one tracer is defined for function graph tracing, it uses
     a static branch to call that tracer directly. When another tracer
     is added, it goes into loop logic to call all the registered
     callbacks.

     The code was incorrect when going back to one tracer and never
     re-enabled the static branch again to do the optimization code.

   - And other small fixes and cleanups"

* tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (46 commits)
  function_graph: Restore direct mode when callbacks drop to one
  tracing: Fix indentation of return statement in print_trace_fmt()
  tracing: Reset last_boot_info if ring buffer is reset
  tracing: Fix to set write permission to per-cpu buffer_size_kb
  tracing: Fix false sharing in hwlat get_sample()
  tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection
  tracing: Better separate SNAPSHOT and MAX_TRACE options
  tracing: Add tracer_uses_snapshot() helper to remove #ifdefs
  tracing: Rename trace_array field max_buffer to snapshot_buffer
  tracing: Move pid filtering into trace_pid.c
  tracing: Move trace_printk functions out of trace.c and into trace_printk.c
  tracing: Use system_state in trace_printk_init_buffers()
  tracing: Have trace_printk functions use flags instead of using global_trace
  tracing: Make tracing_update_buffers() take NULL for global_trace
  tracing: Make printk_trace global for tracing system
  tracing: Move ftrace_trace_stack() out of trace.c and into trace.h
  tracing: Move __trace_buffer_{un}lock_*() functions to trace.h
  tracing: Make tracing_selftest_running global to the tracing subsystem
  tracing: Make tracing_disabled global for tracing system
  tracing: Clean up use of trace_create_maxlat_file()
  ...
2026-02-13 19:25:16 -08:00
Linus Torvalds
72f05009d8 Merge tag 'dma-mapping-7.0-2026-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping update from Marek Szyprowski:
 "A small code cleanup for the DMA-mapping subsystem: removal of unused
  hooks (Robin Murphy)"

* tag 'dma-mapping-7.0-2026-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-mapping: Remove dma_mark_clean (again)
2026-02-13 14:51:39 -08:00
Anton Protopopov
b0b1a8583d bpf: Add a map/btf from a fd array more consistently
The add_fd_from_fd_array() function takes a file descriptor as a
parameter and tries to add either map or btf to the corresponding
list of used objects. As was reported by Dan Carpenter, since the
commit c81e4322acf0 ("bpf: Fix a potential use-after-free of BTF
object"), the fdget() is called twice on the file descriptor, and
thus userspace, potentially, can replace the file pointed to by the
file descriptor in between the two calls. On practice, this shouldn't
break anything on the kernel side, but for consistency fix the code
such that only one fdget() is executed.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/aY689z7gHNv8rgVO@stanley.mountain/
Fixes: ccd2d799ed ("bpf: Fix a potential use-after-free of BTF object")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260213212949.759321-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-13 14:37:02 -08:00
Anton Protopopov
ccd2d799ed bpf: Fix a potential use-after-free of BTF object
Refcounting in the check_pseudo_btf_id() function is incorrect:
the __check_pseudo_btf_id() function might get called with a zero
refcounted btf. Fix this, and patch related code accordingly.

v3: rephrase a comment (AI)
v2: fix a refcount leak introduced in v1 (AI)

Reported-by: syzbot+5a0f1995634f7c1dadbf@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5a0f1995634f7c1dadbf
Fixes: 76145f7255 ("bpf: Refactor check_pseudo_btf_id")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260209132904.63908-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-13 14:14:27 -08:00
Linus Torvalds
a353e7260b Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:

 - in-order support in virtio core

 - multiple address space support in vduse

 - fixes, cleanups all over the place, notably dma alignment fixes for
   non-cache-coherent systems

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (59 commits)
  vduse: avoid adding implicit padding
  vhost: fix caching attributes of MMIO regions by setting them explicitly
  vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr()
  vdpa/mlx5: reuse common function for MAC address updates
  vdpa/mlx5: update mlx_features with driver state check
  crypto: virtio: Replace package id with numa node id
  crypto: virtio: Remove duplicated virtqueue_kick in virtio_crypto_skcipher_crypt_req
  crypto: virtio: Add spinlock protection with virtqueue notification
  Documentation: Add documentation for VDUSE Address Space IDs
  vduse: bump version number
  vduse: add vq group asid support
  vduse: merge tree search logic of IOTLB_GET_FD and IOTLB_GET_INFO ioctls
  vduse: take out allocations from vduse_dev_alloc_coherent
  vduse: remove unused vaddr parameter of vduse_domain_free_coherent
  vduse: refactor vdpa_dev_add for goto err handling
  vhost: forbid change vq groups ASID if DRIVER_OK is set
  vdpa: document set_group_asid thread safety
  vduse: return internal vq group struct as map token
  vduse: add vq group support
  vduse: add v1 API definition
  ...
2026-02-13 12:02:18 -08:00
Shengming Hu
53b2fae90f function_graph: Restore direct mode when callbacks drop to one
When registering a second fgraph callback, direct path is disabled and
array loop is used instead.  When ftrace_graph_active falls back to one,
we try to re-enable direct mode via ftrace_graph_enable_direct(true, ...).
But ftrace_graph_enable_direct() incorrectly disables the static key
rather than enabling it.  This leaves fgraph_do_direct permanently off
after first multi-callback transition, so direct fast mode is never
restored.

Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260213142932519cuWSpEXeS4-UnCvNXnK2P@zte.com.cn
Fixes: cc60ee813b ("function_graph: Use static_call and branch to optimize entry function")
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-13 09:33:14 -05:00
Linus Torvalds
cee73b1e84 Merge tag 'riscv-for-linus-7.0-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Paul Walmsley:

 - Add support for control flow integrity for userspace processes.

   This is based on the standard RISC-V ISA extensions Zicfiss and
   Zicfilp

 - Improve ptrace behavior regarding vector registers, and add some
   selftests

 - Optimize our strlen() assembly

 - Enable the ISO-8859-1 code page as built-in, similar to ARM64, for
   EFI volume mounting

 - Clean up some code slightly, including defining copy_user_page() as
   copy_page() rather than memcpy(), aligning us with other
   architectures; and using max3() to slightly simplify an expression
   in riscv_iommu_init_check()

* tag 'riscv-for-linus-7.0-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (42 commits)
  riscv: lib: optimize strlen loop efficiency
  selftests: riscv: vstate_exec_nolibc: Use the regular prctl() function
  selftests: riscv: verify ptrace accepts valid vector csr values
  selftests: riscv: verify ptrace rejects invalid vector csr inputs
  selftests: riscv: verify syscalls discard vector context
  selftests: riscv: verify initial vector state with ptrace
  selftests: riscv: test ptrace vector interface
  riscv: ptrace: validate input vector csr registers
  riscv: csr: define vtype register elements
  riscv: vector: init vector context with proper vlenb
  riscv: ptrace: return ENODATA for inactive vector extension
  kselftest/riscv: add kselftest for user mode CFI
  riscv: add documentation for shadow stack
  riscv: add documentation for landing pad / indirect branch tracking
  riscv: create a Kconfig fragment for shadow stack and landing pad support
  arch/riscv: add dual vdso creation logic and select vdso based on hw
  arch/riscv: compile vdso with landing pad and shadow stack note
  riscv: enable kernel access to shadow stack memory via the FWFT SBI call
  riscv: add kernel command line option to opt out of user CFI
  riscv/hwprobe: add zicfilp / zicfiss enumeration in hwprobe
  ...
2026-02-12 19:17:44 -08:00
Linus Torvalds
e812928be2 Merge tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL updates from Dave Jiang:

 - Introduce cxl_memdev_attach and pave way for soft reserved handling,
   type2 accelerator enabling, and LSA 2.0 enabling. All these series
   require the endpoint driver to settle before continuing the memdev
   driver probe.

 - Address CXL port error protocol handling and reporting.

   The large patch series was split into three parts. The first two
   parts are included here with the final part coming later.

   The first part consists of a series of code refactoring to PCI AER
   sub-system that addresses CXL and also CXL RAS code to prepare for
   port error handling.

   The second part refactors the CXL code to move management of
   component registers to cxl_port objects to allow all CXL AER errors
   to be handled through the cxl_port hierarchy.

 - Provide AMD Zen5 platform address translation for CXL using ACPI
   PRMT. This includes a conventions document to explain why this is
   needed and how it's implemented.

 - Misc CXL patches of fixes, cleanups, and updates. Including CXL
   address translation for unaligned MOD3 regions.

[ TLA service: CXL is "Compute Express Link" ]

* tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (59 commits)
  cxl: Disable HPA/SPA translation handlers for Normalized Addressing
  cxl/region: Factor out code into cxl_region_setup_poison()
  cxl/atl: Lock decoders that need address translation
  cxl: Enable AMD Zen5 address translation using ACPI PRMT
  cxl/acpi: Prepare use of EFI runtime services
  cxl: Introduce callback for HPA address ranges translation
  cxl/region: Use region data to get the root decoder
  cxl/region: Add @hpa_range argument to function cxl_calc_interleave_pos()
  cxl/region: Separate region parameter setup and region construction
  cxl: Simplify cxl_root_ops allocation and handling
  cxl/region: Store HPA range in struct cxl_region
  cxl/region: Store root decoder in struct cxl_region
  cxl/region: Rename misleading variable name @hpa to @hpa_range
  Documentation/driver-api/cxl: ACPI PRM Address Translation Support and AMD Zen5 enablement
  cxl, doc: Moving conventions in separate files
  cxl, doc: Remove isonum.txt inclusion
  cxl/port: Unify endpoint and switch port lookup
  cxl/port: Move endpoint component register management to cxl_port
  cxl/port: Map Port RAS registers
  cxl/port: Move dport RAS setup to dport add time
  ...
2026-02-12 16:33:05 -08:00
Ran Xiaokai
f7a553b813 kho: remove unnecessary WARN_ON(err) in kho_populate()
The following pr_warn() provides detailed error and location information,
WARN_ON(err) adds no additional debugging value, so remove the redundant
WARN_ON() call.

Link: https://lkml.kernel.org/r/20260212111146.210086-3-ranxiaokai627@163.com
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:45:58 -08:00
Ran Xiaokai
34df6c4734 kho: fix missing early_memunmap() call in kho_populate()
Patch series "two fixes in kho_populate()", v3.


This patch (of 2):

kho_populate() returns without calling early_memunmap() on success path,
this will cause early ioremap virtual address space leak.

Link: https://lkml.kernel.org/r/20260212111146.210086-1-ranxiaokai627@163.com
Link: https://lkml.kernel.org/r/20260212111146.210086-2-ranxiaokai627@163.com
Fixes: b50634c5e8 ("kho: cleanup error handling in kho_populate()")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:45:57 -08:00
Lorenzo Stoakes
5bd2c0650a mm: update all remaining mmap_prepare users to use vma_flags_t
We will be shortly removing the vm_flags_t field from vm_area_desc so we
need to update all mmap_prepare users to only use the dessc->vma_flags
field.

This patch achieves that and makes all ancillary changes required to make
this possible.

This lays the groundwork for future work to eliminate the use of
vm_flags_t in vm_area_desc altogether and more broadly throughout the
kernel.

While we're here, we take the opportunity to replace VM_REMAP_FLAGS with
VMA_REMAP_FLAGS, the vma_flags_t equivalent.

No functional changes intended.

Link: https://lkml.kernel.org/r/fb1f55323799f09fe6a36865b31550c9ec67c225.1769097829.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Damien Le Moal <dlemoal@kernel.org>	[zonefs]
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Yury Norov <ynorov@nvidia.com>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:58 -08:00
Bing Jiao
1aceed565f mm/vmscan: fix demotion targets checks in reclaim/demotion
Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion",
v9.

This patch series addresses two issues in demote_folio_list(),
can_demote(), and next_demotion_node() in reclaim/demotion.

1. demote_folio_list() and can_demote() do not correctly check
   demotion target against cpuset.mems_effective, which will cause (a)
   pages to be demoted to not-allowed nodes and (b) pages fail demotion
   even if the system still has allowed demotion nodes.

   Patch 1 fixes this bug by updating cpuset_node_allowed() and
   mem_cgroup_node_allowed() to return effective_mems, allowing directly
   logic-and operation against demotion targets.

2. next_demotion_node() returns a preferred demotion target, but it
   does not check the node against allowed nodes.

   Patch 2 ensures that next_demotion_node() filters against the allowed
   node mask and selects the closest demotion target to the source node.


This patch (of 2):

Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks against cpuset.mems_effective in reclaim/demotion.

Commit 7d709f49ba ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to can_demote().
However:

  1. It does not apply this check in demote_folio_list(), which leads
     to situations where pages are demoted to nodes that are
     explicitly excluded from the task's cpuset.mems.

  2. It checks only the nodes in the immediate next demotion hierarchy
     and does not check all allowed demotion targets in can_demote().
     This can cause pages to never be demoted if the nodes in the next
     demotion hierarchy are not set in mems_effective.

These bugs break resource isolation provided by cpuset.mems.  This is
visible from userspace because pages can either fail to be demoted
entirely or are demoted to nodes that are not allowed in multi-tier memory
systems.

To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets.  Also update can_demote()
and demote_folio_list() accordingly.

Bug 1 reproduction:
  Assume a system with 4 nodes, where nodes 0-1 are top-tier and
  nodes 2-3 are far-tier memory. All nodes have equal capacity.

  Test script:
    echo 1 > /sys/kernel/mm/numa/demotion_enabled
    mkdir /sys/fs/cgroup/test
    echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
    echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
    echo $$ > /sys/fs/cgroup/test/cgroup.procs
    swapoff -a
    # Expectation: Should respect node 0-2 limit.
    # Observation: Node 3 shows significant allocation (MemFree drops)
    stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1

Bug 2 reproduction:
  Assume a system with 6 nodes, where nodes 0-2 are top-tier,
  node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
  All nodes have equal capacity.

  Test script:
    echo 1 > /sys/kernel/mm/numa/demotion_enabled
    mkdir /sys/fs/cgroup/test
    echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
    echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
    echo $$ > /sys/fs/cgroup/test/cgroup.procs
    swapoff -a
    # Expectation: Pages are demoted to Nodes 4-5
    # Observation: No pages are demoted before oom.
    stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2

Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com
Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com
Fixes: 7d709f49ba ("vmscan,cgroup: apply mems_effective to reclaim")
Signed-off-by: Bing Jiao <bingjiao@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Gregory Price <gourry@gourry.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:52 -08:00
Linus Torvalds
f75c03a761 Merge tag 'trace-rv-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verifier updates from Steven Rostedt:

 - Refactor da_monitor to minimize macros

   Complete refactor of da_monitor.h to reduce reliance on macros
   generating functions. Use generic static functions and uses the
   preprocessor only when strictly necessary (e.g. for tracepoint
   handlers).

   The change essentially relies on functions with generic names (e.g.
   da_handle) instead of monitor-specific as well adding the need to
   define constant (e.g. MONITOR_NAME, MONITOR_TYPE) before including
   the header rather than calling macros that would define functions.
   Also adapt monitors and documentation accordingly.

 - Cleanup DA code generation scripts

   Clean up functions in dot2c removing reimplementations of trivial
   library functions (__buff_to_string) and removing some other unused
   intermediate steps.

 - Annotate functions with types in the rvgen python scripts

 - Remove superfluous assignments and cleanup generated code

   The rvgen scripts generate a superfluous assignment to 0 for enum
   variables and don't add commas to the last elements, which is against
   the kernel coding standards. Change the generation process for a
   better compliance and slightly simpler logic.

 - Remove superfluous declarations from generated code

   The monitor container source files contained a declaration and a
   definition for the rv_monitor variable. The former is superfluous and
   was removed.

 - Fix reference to outdated documentation

   s/da_monitor_synthesis.rst/monitor_synthesis.rst in comment in
   da_monitor.h

* tag 'trace-rv-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Fix documentation reference in da_monitor.h
  verification/rvgen: Remove unused variable declaration from containers
  verification/dot2c: Remove superfluous enum assignment and add last comma
  verification/dot2c: Remove __buff_to_string() and cleanup
  verification/rvgen: Annotate DA functions with types
  verification/rvgen: Adapt dot2k and templates after refactoring da_monitor.h
  Documentation/rv: Adapt documentation after da_monitor refactoring
  rv: Cleanup da_monitor after refactor
  rv: Refactor da_monitor to minimise macros
2026-02-12 14:08:49 -08:00
Linus Torvalds
136114e0ab Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:

 - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
   disk space by teaching ocfs2 to reclaim suballocator block group
   space (Heming Zhao)

 - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
   ARRAY_END() macro and uses it in various places (Alejandro Colomar)

 - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
   the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
   page size (Pnina Feder)

 - "kallsyms: Prevent invalid access when showing module buildid" cleans
   up kallsyms code related to module buildid and fixes an invalid
   access crash when printing backtraces (Petr Mladek)

 - "Address page fault in ima_restore_measurement_list()" fixes a
   kexec-related crash that can occur when booting the second-stage
   kernel on x86 (Harshit Mogalapalli)

 - "kho: ABI headers and Documentation updates" updates the kexec
   handover ABI documentation (Mike Rapoport)

 - "Align atomic storage" adds the __aligned attribute to atomic_t and
   atomic64_t definitions to get natural alignment of both types on
   csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)

 - "kho: clean up page initialization logic" simplifies the page
   initialization logic in kho_restore_page() (Pratyush Yadav)

 - "Unload linux/kernel.h" moves several things out of kernel.h and into
   more appropriate places (Yury Norov)

 - "don't abuse task_struct.group_leader" removes the usage of
   ->group_leader when it is "obviously unnecessary" (Oleg Nesterov)

 - "list private v2 & luo flb" adds some infrastructure improvements to
   the live update orchestrator (Pasha Tatashin)

* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
  watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
  procfs: fix missing RCU protection when reading real_parent in do_task_stat()
  watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
  kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
  kho: fix doc for kho_restore_pages()
  tests/liveupdate: add in-kernel liveupdate test
  liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
  liveupdate: luo_file: Use private list
  list: add kunit test for private list primitives
  list: add primitives for private list manipulations
  delayacct: fix uapi timespec64 definition
  panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
  netclassid: use thread_group_leader(p) in update_classid_task()
  RDMA/umem: don't abuse current->group_leader
  drm/pan*: don't abuse current->group_leader
  drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
  drm/amdgpu: don't abuse current->group_leader
  android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
  android/binder: don't abuse current->group_leader
  kho: skip memoryless NUMA nodes when reserving scratch areas
  ...
2026-02-12 12:13:01 -08:00
Linus Torvalds
4cff5c05e0 Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - "powerpc/64s: do not re-activate batched TLB flush" makes
   arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)

   It adds a generic enter/leave layer and switches architectures to use
   it. Various hacks were removed in the process.

 - "zram: introduce compressed data writeback" implements data
   compression for zram writeback (Richard Chang and Sergey Senozhatsky)

 - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
   page ranges for hugepages. Large improvements during demand faulting
   are demonstrated (David Hildenbrand)

 - "memcg cleanups" tidies up some memcg code (Chen Ridong)

 - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
   stats" improves DAMOS stat's provided information, deterministic
   control, and readability (SeongJae Park)

 - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
   issues in the hugetlb cgroup charging selftests (Li Wang)

 - "Fix va_high_addr_switch.sh test failure - again" addresses several
   issues in the va_high_addr_switch test (Chunyu Hu)

 - "mm/damon/tests/core-kunit: extend existing test scenarios" improves
   the KUnit test coverage for DAMON (Shu Anzai)

 - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
   glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
   transiently return -EAGAIN (Shivank Garg)

 - "arch, mm: consolidate hugetlb early reservation" reworks and
   consolidates a pile of straggly code related to reservation of
   hugetlb memory from bootmem and creation of CMA areas for hugetlb
   (Mike Rapoport)

 - "mm: clean up anon_vma implementation" cleans up the anon_vma
   implementation in various ways (Lorenzo Stoakes)

 - "tweaks for __alloc_pages_slowpath()" does a little streamlining of
   the page allocator's slowpath code (Vlastimil Babka)

 - "memcg: separate private and public ID namespaces" cleans up the
   memcg ID code and prevents the internal-only private IDs from being
   exposed to userspace (Shakeel Butt)

 - "mm: hugetlb: allocate frozen gigantic folio" cleans up the
   allocation of frozen folios and avoids some atomic refcount
   operations (Kefeng Wang)

 - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
   of memory betewwn the active and inactive LRUs and adds auto-tuning
   of the ratio-based quotas and of monitoring intervals (SeongJae Park)

 - "Support page table check on PowerPC" makes
   CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)

 - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
   nodes_and() and nodes_andnot() propagate the return values from the
   underlying bit operations, enabling some cleanup in calling code
   (Yury Norov)

 - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
   some DAMON internal interfaces (SeongJae Park)

 - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
   in khupaged and fixes a scan limit accounting issue (Shivank Garg)

 - "mm: balloon infrastructure cleanups" goes to town on the balloon
   infrastructure and its page migration function. Mainly cleanups, also
   some locking simplification (David Hildenbrand)

 - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
   additional tracepoints to the page reclaim code (Jiayuan Chen)

 - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
   part of Marco's kernel-wide migration from the legacy workqueue APIs
   over to the preferred unbound workqueues (Marco Crivellari)

 - "Various mm kselftests improvements/fixes" provides various unrelated
   improvements/fixes for the mm kselftests (Kevin Brodsky)

 - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
   folio allocation, mainly by avoiding unnecessary work in
   pfn_range_valid_contig() (Kefeng Wang)

 - "selftests/damon: improve leak detection and wss estimation
   reliability" improves the reliability of two of the DAMON selftests
   (SeongJae Park)

 - "mm/damon: cleanup kdamond, damon_call(), damos filter and
   DAMON_MIN_REGION" does some cleanup work in the core DAMON code
   (SeongJae Park)

 - "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
   performs maintenance work on the DAMON documentation (SeongJae Park)

 - "mm: add and use vma_assert_stabilised() helper" refactors and cleans
   up the core VMA code. The main aim here is to be able to use the mmap
   write lock's lockdep state to perform various assertions regarding
   the locking which the VMA code requires (Lorenzo Stoakes)

 - "mm, swap: swap table phase II: unify swapin use" removes some old
   swap code (swap cache bypassing and swap synchronization) which
   wasn't working very well. Various other cleanups and simplifications
   were made. The end result is a 20% speedup in one benchmark (Kairui
   Song)

 - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
   available on 64-bit alpha, loongarch, mips, parisc, and um. Various
   cleanups were performed along the way (Qi Zheng)

* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
  mm/memory: handle non-split locks correctly in zap_empty_pte_table()
  mm: move pte table reclaim code to memory.c
  mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
  mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
  um: mm: enable MMU_GATHER_RCU_TABLE_FREE
  parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
  LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
  alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
  mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
  zsmalloc: make common caches global
  mm: add SPDX id lines to some mm source files
  mm/zswap: use %pe to print error pointers
  mm/vmscan: use %pe to print error pointers
  mm/readahead: fix typo in comment
  mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
  mm: refactor vma_map_pages to use vm_insert_pages
  mm/damon: unify address range representation with damon_addr_range
  mm/cma: replace snprintf with strscpy in cma_new_area
  ...
2026-02-12 11:32:37 -08:00
Linus Torvalds
37a93dd5c4 Merge tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
 "Core & protocols:

   - A significant effort all around the stack to guide the compiler to
     make the right choice when inlining code, to avoid unneeded calls
     for small helper and stack canary overhead in the fast-path.

     This generates better and faster code with very small or no text
     size increases, as in many cases the call generated more code than
     the actual inlined helper.

   - Extend AccECN implementation so that is now functionally complete,
     also allow the user-space enabling it on a per network namespace
     basis.

   - Add support for memory providers with large (above 4K) rx buffer.
     Paired with hw-gro, larger rx buffer sizes reduce the number of
     buffers traversing the stack, dincreasing single stream CPU usage
     by up to ~30%.

   - Do not add HBH header to Big TCP GSO packets. This simplifies the
     RX path, the TX path and the NIC drivers, and is possible because
     user-space taps can now interpret correctly such packets without
     the HBH hint.

   - Allow IPv6 routes to be configured with a gateway address that is
     resolved out of a different interface than the one specified,
     aligning IPv6 to IPv4 behavior.

   - Multi-queue aware sch_cake. This makes it possible to scale the
     rate shaper of sch_cake across multiple CPUs, while still enforcing
     a single global rate on the interface.

   - Add support for the nbcon (new buffer console) infrastructure to
     netconsole, enabling lock-free, priority-based console operations
     that are safer in crash scenarios.

   - Improve the TCP ipv6 output path to cache the flow information,
     saving cpu cycles, reducing cache line misses and stack use.

   - Improve netfilter packet tracker to resolve clashes for most
     protocols, avoiding unneeded drops on rare occasions.

   - Add IP6IP6 tunneling acceleration to the flowtable infrastructure.

   - Reduce tcp socket size by one cache line.

   - Notify neighbour changes atomically, avoiding inconsistencies
     between the notification sequence and the actual states sequence.

   - Add vsock namespace support, allowing complete isolation of vsocks
     across different network namespaces.

   - Improve xsk generic performances with cache-alignment-oriented
     optimizations.

   - Support netconsole automatic target recovery, allowing netconsole
     to reestablish targets when underlying low-level interface comes
     back online.

  Driver API:

   - Support for switching the working mode (automatic vs manual) of a
     DPLL device via netlink.

   - Introduce PHY ports representation to expose multiple front-facing
     media ports over a single MAC.

   - Introduce "rx-polarity" and "tx-polarity" device tree properties,
     to generalize polarity inversion requirements for differential
     signaling.

   - Add helper to create, prepare and enable managed clocks.

  Device drivers:

   - Add Huawei hinic3 PF etherner driver.

   - Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet
     controller.

   - Add ethernet driver for MaxLinear MxL862xx switches

   - Remove parallel-port Ethernet driver.

   - Convert existing driver timestamp configuration reporting to
     hwtstamp_get and remove legacy ioctl().

   - Convert existing drivers to .get_rx_ring_count(), simplifing the RX
     ring count retrieval. Also remove the legacy fallback path.

   - Ethernet high-speed NICs:
      - Broadcom (bnxt, bng):
         - bnxt: add FW interface update to support FEC stats histogram
           and NVRAM defragmentation
         - bng: add TSO and H/W GRO support
      - nVidia/Mellanox (mlx5):
         - improve latency of channel restart operations, reducing the
           used H/W resources
         - add TSO support for UDP over GRE over VLAN
         - add flow counters support for hardware steering (HWS) rules
         - use a static memory area to store headers for H/W GRO,
           leading to 12% RX tput improvement
      - Intel (100G, ice, idpf):
         - ice: reorganizes layout of Tx and Rx rings for cacheline
           locality and utilizes __cacheline_group* macros on the new
           layouts
         - ice: introduces Synchronous Ethernet (SyncE) support
      - Meta (fbnic):
         - adds debugfs for firmware mailbox and tx/rx rings vectors

   - Ethernet virtual:
      - geneve: introduce GRO/GSO support for double UDP encapsulation

   - Ethernet NICs consumer, and embedded:
      - Synopsys (stmmac):
         - some code refactoring and cleanups
      - RealTek (r8169):
         - add support for RTL8127ATF (10G Fiber SFP)
         - add dash and LTR support
      - Airoha:
         - AN8811HB 2.5 Gbps phy support
      - Freescale (fec):
         - add XDP zero-copy support
      - Thunderbolt:
         - add get link setting support to allow bonding
      - Renesas:
         - add support for RZ/G3L GBETH SoC

   - Ethernet switches:
      - Maxlinear:
         - support R(G)MII slow rate configuration
         - add support for Intel GSW150
      - Motorcomm (yt921x):
         - add DCB/QoS support
      - TI:
         - icssm-prueth: support bridging (STP/RSTP) via the switchdev
           framework

   - Ethernet PHYs:
      - Realtek:
         - enable SGMII and 2500Base-X in-band auto-negotiation
         - simplify and reunify C22/C45 drivers
      - Micrel: convert bindings to DT schema

   - CAN:
      - move skb headroom content into skb extensions, making CAN
        metadata access more robust

   - CAN drivers:
      - rcar_canfd:
         - add support for FD-only mode
         - add support for the RZ/T2H SoC
      - sja1000: cleanup the CAN state handling

   - WiFi:
      - implement EPPKE/802.1X over auth frames support
      - split up drop reasons better, removing generic RX_DROP
      - additional FTM capabilities: 6 GHz support, supported number of
        spatial streams and supported number of LTF repetitions
      - better mac80211 iterators to enumerate resources
      - initial UHR (Wi-Fi 8) support for cfg80211/mac80211

   - WiFi drivers:
      - Qualcomm/Atheros:
         - ath11k: support for Channel Frequency Response measurement
         - ath12k: a significant driver refactor to support multi-wiphy
           devices and and pave the way for future device support in the
           same driver (rather than splitting to ath13k)
         - ath12k: support for the QCC2072 chipset
      - Intel:
         - iwlwifi: partial Neighbor Awareness Networking (NAN) support
         - iwlwifi: initial support for U-NII-9 and IEEE 802.11bn
      - RealTek (rtw89):
         - preparations for RTL8922DE support

   - Bluetooth:
      - implement setsockopt(BT_PHY) to set the connection packet type/PHY
      - set link_policy on incoming ACL connections

   - Bluetooth drivers:
      - btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE
      - btqca: add WCN6855 firmware priority selection feature"

* tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits)
  bnge/bng_re: Add a new HSI
  net: macb: Fix tx/rx malfunction after phy link down and up
  af_unix: Fix memleak of newsk in unix_stream_connect().
  net: ti: icssg-prueth: Add optional dependency on HSR
  net: dsa: add basic initial driver for MxL862xx switches
  net: mdio: add unlocked mdiodev C45 bus accessors
  net: dsa: add tag format for MxL862xx switches
  dt-bindings: net: dsa: add MaxLinear MxL862xx
  selftests: drivers: net: hw: Modify toeplitz.c to poll for packets
  octeontx2-pf: Unregister devlink on probe failure
  net: renesas: rswitch: fix forwarding offload statemachine
  ionic: Rate limit unknown xcvr type messages
  tcp: inet6_csk_xmit() optimization
  tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock()
  tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect()
  ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6
  ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update()
  ipv6: use np->final in inet6_sk_rebuild_header()
  ipv6: add daddr/final storage in struct ipv6_pinfo
  net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup()
  ...
2026-02-11 19:31:52 -08:00
Haoyang LIU
fa4820b893 tracing: Fix indentation of return statement in print_trace_fmt()
The return statement inside the nested if block in print_trace_fmt()
is not properly indented, making the code structure unclear. This was
flagged by smatch as a warning.

Add proper indentation to the return statement to match the kernel
coding style and improve readability.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260210153903.8041-1-tttturtleruss@gmail.com
Signed-off-by: Haoyang LIU <tttturtleruss@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11 21:58:21 -05:00
Linus Torvalds
1c2b4a4c2b Merge tag 'pci-v7.0-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI updates from Bjorn Helgaas:
 "Enumeration:

   - Don't try to enable Extended Tags on VFs since that bit is Reserved
     and causes misleading log messages (Håkon Bugge)

   - Initialize Endpoint Read Completion Boundary to match Root Port,
     regardless of ACPI _HPX (Håkon Bugge)

   - Apply _HPX PCIe Setting Record only to AER configuration, and only
     when OS owns PCIe hotplug but not AER, to avoid clobbering Extended
     Tag and Relaxed Ordering settings (Håkon Bugge)

  Resource management:

   - Move CardBus code to setup-cardbus.c and only build it when
     CONFIG_CARDBUS is set (Ilpo Järvinen)

   - Fix bridge window alignment with optional resources, where
     additional alignment requirement was previously lost (Ilpo
     Järvinen)

   - Stop over-estimating bridge window size since they are now assigned
     without any gaps between them (Ilpo Järvinen)

   - Increase resource MAX_IORES_LEVEL to avoid /proc/iomem flattening
     for nested bridges and endpoints (Ilpo Järvinen)

   - Add pbus_mem_size_optional() to handle sizes of optional resources
     (SR-IOV VF BARs, expansion ROMs, bridge windows) (Ilpo Järvinen)

   - Don't claim disabled bridge windows to avoid spurious claim
     failures (Ilpo Järvinen)

  Driver binding:

   - Fix device reference leak in pcie_port_remove_service() (Uwe
     Kleine-König)

   - Move pcie_port_bus_match() and pcie_port_bus_type to PCIe-specific
     portdrv.c (Uwe Kleine-König)

   - Convert portdrv to use pcie_port_bus_type.probe() and .remove()
     callbacks so .probe() and .remove() can eventually be removed from
     struct device_driver (Uwe Kleine-König)

  Error handling:

   - Clear stale errors on reporting agents upon probe so they don't
     look like recent errors (Lukas Wunner)

   - Add generic RAS tracepoint for hotplug events (Shuai Xue)

   - Add RAS tracepoint for link speed changes (Shuai Xue)

  Power management:

   - Avoid redundant delay on transition from D3hot to D3cold if the
     device was already in D3hot (Brian Norris)

   - Prevent runtime suspend until devices are fully initialized to
     avoid saving incompletely configured device state (Brian Norris)

  Power control:

   - Add power_on/off callbacks with generic signature to pwrseq,
     tc9563, and slot drivers so they can be used by pwrctrl core
     (Manivannan Sadhasivam)

   - Add PCIe M.2 connector support to the slot pwrctrl driver
     (Manivannan Sadhasivam)

   - Switch to pwrctrl interfaces to create, destroy, and power on/off
     devices, calling them from host controller drivers instead of the
     PCI core (Manivannan Sadhasivam)

   - Drop qcom .assert_perst() callbacks since this is now done by the
     controller driver instead of the pwrctrl driver (Manivannan
     Sadhasivam)

  Virtualization:

   - Remove an incorrect unlock in pci_slot_trylock() error handling
     (Jinhui Guo)

   - Lock the bridge device for slot reset (Keith Busch)

   - Enable ACS after IOMMU configuration on OF platforms so ACS is
     enabled an all devices; previously the first device enumerated
     (typically a Root Port) didn't have ACS enabled (Manivannan
     Sadhasivam)

   - Disable ACS Source Validation for IDT 0x80b5 and 0x8090 switches to
     work around hardware erratum; previously ACS SV was only
     temporarily disabled, which worked for enumeration but not after
     reset (Manivannan Sadhasivam)

  Peer-to-peer DMA:

   - Release per-CPU pgmap ref when vm_insert_page() fails to avoid hang
     when removing the PCI device (Hou Tao)

   - Remove incorrect p2pmem_alloc_mmap() warning about page refcount
     (Hou Tao)

  Endpoint framework:

   - Add configfs sub-groups synchronously to avoid NULL pointer
     dereference when racing with removal (Liu Song)

   - Fix swapped parameters in pci_{primary/secondary}_epc_epf_unlink()
     functions (Manikanta Maddireddy)

  ASPEED PCIe controller driver:

   - Add ASPEED Root Complex DT binding and driver (Jacky Chou)

  Freescale i.MX6 PCIe controller driver:

   - Add DT binding and driver support for an optional external refclock
     in addition to the refclock from the internal PLL (Richard Zhu)

   - Fix CLKREQ# control so host asserts it during enumeration and
     Endpoints can use it afterwards to exit the L1.2 link state
     (Richard Zhu)

  NVIDIA Tegra PCIe controller driver:

   - Export irq_domain_free_irqs() to allow PCI/MSI drivers that tear
     down MSI domains to be built as modules (Aaron Kling)

   - Allow pci-tegra to be built as a module (Aaron Kling)

  NVIDIA Tegra194 PCIe controller driver:

   - Relax Kconfig so tegra194 can be built for platforms beyond
     Tegra194 (Vidya Sagar)

  Qualcomm PCIe controller driver:

   - Merge SC8180x DT binding into SM8150 (Krzysztof Kozlowski)

   - Move SDX55, SDM845, QCS404, IPQ5018, IPQ6018, IPQ8074 Gen3,
     IPQ8074, IPQ4019, IPQ9574, APQ8064, MSM8996, APQ8084 to dedicated
     schema (Krzysztof Kozlowski)

   - Add DT binding and driver support for SA8255p Endpoint being
     configured by firmware (Mrinmay Sarkar)

   - Parse PERST# from all PCIe bridge nodes for future platforms that
     will have PERST# in Switch Downstream Ports as well as in Root
     Ports (Manivannan Sadhasivam)

  Renesas RZ/G3S PCIe controller driver:

   - Use pci_generic_config_write() since the writability provided by
     the custom wrapper is unnecessary (Claudiu Beznea)

  SOPHGO PCIe controller driver:

   - Disable ASPM L0s and L1 on Sophgo 2044 PCIe Root Ports (Inochi
     Amaoto)

  Synopsys DesignWare PCIe controller driver:

   - Extend PCI_FIND_NEXT_CAP() and PCI_FIND_NEXT_EXT_CAP() to return a
     pointer to the preceding Capability, to allow removal of
     Capabilities that are advertised but not fully implemented (Qiang
     Yu)

   - Remove MSI and MSI-X Capabilities in platforms that can't support
     them, so the PCI core automatically falls back to INTx (Qiang Yu)

   - Add ASPM L1.1 and L1.2 Substates context to debugfs ltssm_status
     for drivers that support this (Shawn Lin)

   - Skip PME_Turn_Off broadcast and L2/L3 transition during suspend if
     link is not up to avoid an unnecessary timeout (Manivannan
     Sadhasivam)

   - Revert dw-rockchip, qcom, and DWC core changes that used link-up
     IRQs to trigger enumeration instead of waiting for link to be up
     because the PCI core doesn't allocate bus number space for
     hierarchies that might be attached (Niklas Cassel)

   - Make endpoint iATU entry for MSI permanent instead of programming
     it dynamically, which is slow and racy with respect to other
     concurrent traffic, e.g., eDMA (Koichiro Den)

   - Use iMSI-RX MSI target address when possible to fix endpoints using
     32-bit MSI (Shawn Lin)

   - Allow DWC host controller driver probe to continue if device is not
     found or found but inactive; only fail when there's an error with
     the link (Manivannan Sadhasivam)

   - For controllers like NXP i.MX6QP and i.MX7D, where LTSSM registers
     are not accessible after PME_Turn_Off, simply wait 10ms instead of
     polling for L2/L3 Ready (Richard Zhu)

   - Use multiple iATU entries to map large bridge windows and DMA
     ranges when necessary instead of failing (Samuel Holland)

   - Add EPC dynamic_inbound_mapping feature bit for Endpoint
     Controllers that can update BAR inbound address translation without
     requiring EPF driver to clear/reset the BAR first, and advertise it
     for DWC-based Endpoints (Koichiro Den)

   - Add EPC subrange_mapping feature bit for Endpoint Controllers that
     can map multiple independent inbound regions in a single BAR,
     implement subrange mapping, advertise it for DWC-based Endpoints,
     and add Endpoint selftests for it (Koichiro Den)

   - Make resizable BARs work for Endpoint multi-PF configurations;
     previously it only worked for PF 0 (Aksh Garg)

   - Fix Endpoint non-PF 0 support for BAR configuration, ATU mappings,
     and Address Match Mode (Aksh Garg)

   - Set up iATU when ECAM is enabled; previously IO and MEM outbound
     windows weren't programmed, and ECAM-related iATU entries weren't
     restored after suspend/resume, so config accesses failed (Krishna
     Chaitanya Chundru)

  Miscellaneous:

   - Use system_percpu_wq and WQ_PERCPU to explicitly request per-CPU
     work so WQ_UNBOUND can eventually be removed (Marco Crivellari)"

* tag 'pci-v7.0-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (176 commits)
  PCI/bwctrl: Disable BW controller on Intel P45 using a quirk
  PCI: Disable ACS SV for IDT 0x8090 switch
  PCI: Disable ACS SV for IDT 0x80b5 switch
  PCI: Cache ACS Capabilities register
  PCI: Enable ACS after configuring IOMMU for OF platforms
  PCI: Add ACS quirk for Pericom PI7C9X2G404 switches [12d8:b404]
  PCI: Add ACS quirk for Qualcomm Hamoa & Glymur
  PCI: Use device_lock_assert() to verify device lock is held
  PCI: Use lockdep_assert_held(pci_bus_sem) to verify lock is held
  PCI: Fix pci_slot_lock () device locking
  PCI: Fix pci_slot_trylock() error handling
  PCI: Mark Nvidia GB10 to avoid bus reset
  PCI: Mark ASM1164 SATA controller to avoid bus reset
  PCI: host-generic: Avoid reporting incorrect 'missing reg property' error
  PCI/PME: Replace RMW of Root Status register with direct write
  PCI/AER: Clear stale errors on reporting agents upon probe
  PCI: Don't claim disabled bridge windows
  PCI: rzg3s-host: Fix device node reference leak in rzg3s_pcie_host_parse_port()
  PCI: dwc: Fix missing iATU setup when ECAM is enabled
  PCI: dwc: Clean up iATU index usage in dw_pcie_iatu_setup()
  ...
2026-02-11 17:20:38 -08:00
Linus Torvalds
db9571a661 Merge tag 'printk-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:

 - Check all mandatory callbacks when registering nbcon consoles

 - Fix some compiler warnings

* tag 'printk-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  vsnprintf: drop __printf() attributes on binary printing functions
  printf: convert test_hashed into macro
  printk: nbcon: Check for device_{lock,unlock} callbacks
2026-02-11 14:36:47 -08:00
Linus Torvalds
41f1a08645 Merge tag 'kbuild-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild/Kconfig updates from Nathan Chancellor:
 "Kbuild:

   - Drop '*_probe' pattern from modpost section check allowlist, which
     hid legitimate warnings (Johan Hovold)

   - Disable -Wtype-limits altogether, instead of enabling at W=2
     (Vincent Mailhol)

   - Improve UAPI testing to skip testing headers that require a libc
     when CONFIG_CC_CAN_LINK is not set, opening up testing of headers
     with no libc dependencies to more environments (Thomas Weißschuh)

   - Update gendwarfksyms documentation with required dependencies
     (Jihan LIN)

   - Reject invalid LLVM= values to avoid unintentionally falling back
     to system toolchain (Thomas Weißschuh)

   - Add a script to help run the kernel build process in a container
     for consistent environments and testing (Guillaume Tucker)

   - Simplify kallsyms by getting rid of the relative base (Ard
     Biesheuvel)

   - Performance and usability improvements to scripts/make_fit.py
     (Simon Glass)

   - Minor various clean ups and fixes

  Kconfig:

   - Move XPM icons to individual files, clearing up GTK deprecation
     warnings (Rostislav Krasny)

   - Support

        depends on FOO if BAR

     as syntactic sugar for

        depends on FOO || !BAR

     (Nicolas Pitre, Graham Roff)

   - Refactor merge_config.sh to use awk over shell/sed/grep,
     dramatically speeding up processing large number of config
     fragments (Anders Roxell, Mikko Rapeli)"

* tag 'kbuild-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: (39 commits)
  kbuild: remove dependency of run-command on config
  scripts/make_fit: Compress dtbs in parallel
  scripts/make_fit: Support a few more parallel compressors
  kbuild: Support a FIT_EXTRA_ARGS environment variable
  scripts/make_fit: Move dtb processing into a function
  scripts/make_fit: Support an initial ramdisk
  scripts/make_fit: Speed up operation
  rust: kconfig: Don't require RUST_IS_AVAILABLE for rustc-option
  MAINTAINERS: Add scripts/install.sh into Kbuild entry
  modpost: Amend ppc64 save/restfpr symnames for -Os build
  MIPS: tools: relocs: Ship a definition of R_MIPS_PC32
  streamline_config.pl: remove superfluous exclamation mark
  kbuild: dummy-tools: Add python3
  scripts: kconfig: merge_config.sh: warn on duplicate input files
  scripts: kconfig: merge_config.sh: use awk in checks too
  scripts: kconfig: merge_config.sh: refactor from shell/sed/grep to awk
  kallsyms: Get rid of kallsyms relative base
  mips: Add support for PC32 relocations in vmlinux
  Documentation: dev-tools: add container.rst page
  scripts: add tool to run containerized builds
  ...
2026-02-11 13:40:35 -08:00
Linus Torvalds
38ef046544 Merge tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:

 - Move C example schedulers back from the external scx repo to
   tools/sched_ext as the authoritative source. scx_userland and
   scx_pair are returning while scx_sdt (BPF arena-based task data
   management) is new. These schedulers will be dropped from the
   external repo.

 - Improve error reporting by adding scx_bpf_error() calls when DSQ
   creation fails across all in-tree schedulers

 - Avoid redundant irq_work_queue() calls in destroy_dsq() by only
   queueing when llist_add() indicates an empty list

 - Fix flaky init_enable_count selftest by properly synchronizing
   pre-forked children using a pipe instead of sleep()

* tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  selftests/sched_ext: Fix init_enable_count flakiness
  tools/sched_ext: Fix data header access during free in scx_sdt
  tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers
  tools/sched_ext: add arena based scheduler
  tools/sched_ext: add scx_pair scheduler
  tools/sched_ext: add scx_userland scheduler
  sched_ext: Add error logging for dsq creation failures
  sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()
2026-02-11 13:35:24 -08:00
Linus Torvalds
ff661eeee2 Merge tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - cpuset changes:

    - Continue separating v1 and v2 implementations by moving more
      v1-specific logic into cpuset-v1.c

    - Improve partition handling. Sibling partitions are no longer
      invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
      fail in v2, and effective_xcpus computation is made consistent

    - Fix partition effective CPUs overlap that caused a warning on
      cpuset removal when sibling partitions shared CPUs

 - Increase the maximum cgroup subsystem count from 16 to 32 to
   accommodate future subsystem additions

 - Misc cleanups and selftest improvements including switching to
   css_is_online() helper, removing dead code and stale documentation
   references, using lockdep_assert_cpuset_lock_held() consistently, and
   adding polling helpers for asynchronously updated cgroup statistics

* tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  cpuset: fix overlap of partition effective CPUs
  cgroup: increase maximum subsystem count from 16 to 32
  cgroup: Remove stale cpu.rt.max reference from documentation
  cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held()
  cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()
  cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict
  cgroup/cpuset: Don't fail cpuset.cpus change in v2
  cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()
  cgroup/cpuset: Streamline rm_siblings_excl_cpus()
  cpuset: remove dead code in cpuset-v1.c
  cpuset: remove v1-specific code from generate_sched_domains
  cpuset: separate generate_sched_domains for v1 and v2
  cpuset: move update_domain_attr_tree to cpuset_v1.c
  cpuset: add cpuset1_init helper for v1 initialization
  cpuset: add cpuset1_online_css helper for v1-specific operations
  cpuset: add lockdep_assert_cpuset_lock_held helper
  cpuset: Remove unnecessary checks in rebuild_sched_domains_locked
  cgroup: switch to css_is_online() helper
  selftests: cgroup: Replace sleep with cg_read_key_long_poll() for waiting on nr_dying_descendants
  selftests: cgroup: make test_memcg_sock robust against delayed sock stats
  ...
2026-02-11 13:20:50 -08:00
Linus Torvalds
9bdc64892d Merge tag 'wq-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:

 - Rework the rescuer to process work items one-by-one instead of
   slurping all pending work items in a single pass.

   As there is only one rescuer per workqueue, a single long-blocking
   work item could cause high latency for all tasks queued behind it,
   even after memory pressure is relieved and regular kworkers become
   available to service them.

 - Add CONFIG_BOOTPARAM_WQ_STALL_PANIC build-time option and
   workqueue.panic_on_stall_time parameter for time-based stall panic,
   giving systems more control over workqueue stall handling.

 - Replace BUG_ON() with panic() in the stall panic path for clearer
   intent and more informative output.

* tag 'wq-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: replace BUG_ON with panic in panic_on_wq_watchdog
  workqueue: add time-based panic for stalls
  workqueue: add CONFIG_BOOTPARAM_WQ_STALL_PANIC option
  workqueue: Process extra works in rescuer on memory pressure
  workqueue: Process rescuer work items one-by-one using a cursor
  workqueue: Make send_mayday() take a PWQ argument directly
2026-02-11 13:13:32 -08:00
Thomas Gleixner
1e83ccd592 sched/mmcid: Don't assume CID is CPU owned on mode switch
Shinichiro reported a KASAN UAF, which is actually an out of bounds access
in the MMCID management code.

   CPU0						CPU1
   						T1 runs in userspace
   T0: fork(T4) -> Switch to per CPU CID mode
         fixup() set MM_CID_TRANSIT on T1/CPU1
   T4 exit()
   T3 exit()
   T2 exit()
						T1 exit() switch to per task mode
						 ---> Out of bounds access.

As T1 has not scheduled after T0 set the TRANSIT bit, it exits with the
TRANSIT bit set. sched_mm_cid_remove_user() clears the TRANSIT bit in
the task and drops the CID, but it does not touch the per CPU storage.
That's functionally correct because a CID is only owned by the CPU when
the ONCPU bit is set, which is mutually exclusive with the TRANSIT flag.

Now sched_mm_cid_exit() assumes that the CID is CPU owned because the
prior mode was per CPU. It invokes mm_drop_cid_on_cpu() which clears the
not set ONCPU bit and then invokes clear_bit() with an insanely large
bit number because TRANSIT is set (bit 29).

Prevent that by actually validating that the CID is CPU owned in
mm_drop_cid_on_cpu().

Fixes: 007d84287c ("sched/mmcid: Drop per CPU CID immediately when switching to per task mode")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/aYsZrixn9b6s_2zL@shinmob
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-11 12:59:56 -08:00
Masami Hiramatsu (Google)
804c4a2209 tracing: Reset last_boot_info if ring buffer is reset
Commit 32dc004252 ("tracing: Reset last-boot buffers when reading
out all cpu buffers") resets the last_boot_info when user read out
all data via trace_pipe* files. But it is not reset when user
resets the buffer from other files. (e.g. write `trace` file)

Reset it when the corresponding ring buffer is reset too.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071302364.2293046.17895165659153977720.stgit@mhiramat.tok.corp.google.com
Fixes: 32dc004252 ("tracing: Reset last-boot buffers when reading out all cpu buffers")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11 10:49:48 -05:00
Masami Hiramatsu (Google)
f844282dee tracing: Fix to set write permission to per-cpu buffer_size_kb
Since the per-cpu buffer_size_kb file is writable for changing
per-cpu ring buffer size, the file should have the write access
permission.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071301597.2293046.11683339475076917920.stgit@mhiramat.tok.corp.google.com
Fixes: 21ccc9cd72 ("tracing: Disable "other" permission bits in the tracefs files")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11 10:48:44 -05:00
Petr Mladek
9abbecf408 Merge branch 'for-6.20' into for-linus 2026-02-11 10:14:35 +01:00
Linus Torvalds
192c015940 Merge tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates for 7.0

 - Implement masked user access

 - Add bpf support for internal only per-CPU instructions and inline the
   bpf_get_smp_processor_id() and bpf_get_current_task() functions

 - Fix pSeries MSI-X allocation failure when quota is exceeded

 - Fix recursive pci_lock_rescan_remove locking in EEH event handling

 - Support tailcalls with subprogs & BPF exceptions on 64bit

 - Extend "trusted" keys to support the PowerVM Key Wrapping Module
   (PKWM)

Thanks to Abhishek Dubey, Christophe Leroy, Gaurav Batra, Guangshuo Li,
Jarkko Sakkinen, Mahesh Salgaonkar, Mimi Zohar, Miquel Sabaté Solà, Nam
Cao, Narayana Murty N, Nayna Jain, Nilay Shroff, Puranjay Mohan, Saket
Kumar Bhaskar, Sourabh Jain, Srish Srinivasan, and Venkat Rao Bagalkote.

* tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (27 commits)
  powerpc/pseries: plpks: export plpks_wrapping_is_supported
  docs: trusted-encryped: add PKWM as a new trust source
  keys/trusted_keys: establish PKWM as a trusted source
  pseries/plpks: add HCALLs for PowerVM Key Wrapping Module
  pseries/plpks: expose PowerVM wrapping features via the sysfs
  powerpc/pseries: move the PLPKS config inside its own sysfs directory
  pseries/plpks: fix kernel-doc comment inconsistencies
  powerpc/smp: Add check for kcalloc() failure in parse_thread_groups()
  powerpc: kgdb: Remove OUTBUFMAX constant
  powerpc64/bpf: Additional NVR handling for bpf_throw
  powerpc64/bpf: Support exceptions
  powerpc64/bpf: Add arch_bpf_stack_walk() for BPF JIT
  powerpc64/bpf: Avoid tailcall restore from trampoline
  powerpc64/bpf: Support tailcalls with subprogs
  powerpc64/bpf: Moving tail_call_cnt to bottom of frame
  powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling
  powerpc/pseries: Fix MSI-X allocation failure when quota is exceeded
  powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memory
  powerpc64/bpf: Inline bpf_get_smp_processor_id() and bpf_get_current_task/_btf()
  powerpc64/bpf: Support internal-only MOV instruction to resolve per-CPU addrs
  ...
2026-02-10 21:46:12 -08:00
Breno Leitao
60325c27d3 printk: Add execution context (task name/CPU) to printk_info
Extend struct printk_info to include the task name, pid, and CPU
number where printk messages originate. This information is captured
at vprintk_store() time and propagated through printk_message to
nbcon_write_context, making it available to nbcon console drivers.

This is useful for consoles like netconsole that want to include
execution context in their output, allowing correlation of messages
with specific tasks and CPUs regardless of where the console driver
actually runs.

The feature is controlled by CONFIG_PRINTK_EXECUTION_CTX, which is
automatically selected by CONFIG_NETCONSOLE_DYNAMIC. When disabled,
the helper functions compile to no-ops with no overhead.

Suggested-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20260206-nbcon-v7-1-62bda69b1b41@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-10 19:51:56 -08:00