Introduce new DMA mapping functions dma_map_phys() and dma_unmap_phys()
that operate directly on physical addresses instead of page+offset
parameters. This provides a more efficient interface for drivers that
already have physical addresses available.
The new functions are implemented as the primary mapping layer, with
the existing dma_map_page_attrs()/dma_map_resource() and
dma_unmap_page_attrs()/dma_unmap_resource() functions converted to simple
wrappers around the phys-based implementations.
In case dma_map_page_attrs(), the struct page is converted to physical
address with help of page_to_phys() function and dma_map_resource()
provides physical address as is together with addition of DMA_ATTR_MMIO
attribute.
The old page-based API is preserved in mapping.c to ensure that existing
code won't be affected by changing EXPORT_SYMBOL to EXPORT_SYMBOL_GPL
variant for dma_*map_phys().
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/54cc52af91777906bbe4a386113437ba0bcfba9c.1757423202.git.leonro@nvidia.com
Make dma_map_page_attrs() and dma_map_page_attrs() respect
DMA_ATTR_MMIO.
DMA_ATR_MMIO makes the functions behave the same as
dma_(un)map_resource():
- No swiotlb is possible
- Legacy dma_ops arches use ops->map_resource()
- No kmsan
- No arch_dma_map_phys_direct()
The prior patches have made the internal functions called here
support DMA_ATTR_MMIO.
This is also preparation for turning dma_map_resource() into an inline
calling dma_map_phys(DMA_ATTR_MMIO) to consolidate the flows.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/3660e2c78ea409d6c483a215858fb3af52cd0ed3.1757423202.git.leonro@nvidia.com
Convert the DMA direct mapping functions to accept physical addresses
directly instead of page+offset parameters. The functions were already
operating on physical addresses internally, so this change eliminates
the redundant page-to-physical conversion at the API boundary.
The functions dma_direct_map_page() and dma_direct_unmap_page() are
renamed to dma_direct_map_phys() and dma_direct_unmap_phys() respectively,
with their calling convention changed from (struct page *page,
unsigned long offset) to (phys_addr_t phys).
Architecture-specific functions arch_dma_map_page_direct() and
arch_dma_unmap_page_direct() are similarly renamed to
arch_dma_map_phys_direct() and arch_dma_unmap_phys_direct().
The is_pci_p2pdma_page() checks are replaced with DMA_ATTR_MMIO checks
to allow integration with dma_direct_map_resource and dma_direct_map_phys()
is extended to support MMIO path either.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/bb15a22f76dc2e26683333ff54e789606cfbfcf0.1757423202.git.leonro@nvidia.com
Rename the IOMMU DMA mapping functions to better reflect their actual
calling convention. The functions iommu_dma_map_page() and
iommu_dma_unmap_page() are renamed to iommu_dma_map_phys() and
iommu_dma_unmap_phys() respectively, as they already operate on physical
addresses rather than page structures.
The calling convention changes from accepting (struct page *page,
unsigned long offset) to (phys_addr_t phys), which eliminates the need
for page-to-physical address conversion within the functions. This
renaming prepares for the broader DMA API conversion from page-based
to physical address-based mapping throughout the kernel.
All callers are updated to pass physical addresses directly, including
dma_map_page_attrs(), scatterlist mapping functions, and DMA page
allocation helpers. The change simplifies the code by removing the
page_to_phys() + offset calculation that was previously done inside
the IOMMU functions.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/ed172f95f8f57782beae04f782813366894e98df.1757423202.git.leonro@nvidia.com
Convert the DMA debug infrastructure from page-based to physical address-based
mapping as a preparation to rely on physical address for DMA mapping routines.
The refactoring renames debug_dma_map_page() to debug_dma_map_phys() and
changes its signature to accept a phys_addr_t parameter instead of struct page
and offset. Similarly, debug_dma_unmap_page() becomes debug_dma_unmap_phys().
A new dma_debug_phy type is introduced to distinguish physical address mappings
from other debug entry types. All callers throughout the codebase are updated
to pass physical addresses directly, eliminating the need for page-to-physical
conversion in the debug layer.
This refactoring eliminates the need to convert between page pointers and
physical addresses in the debug layer, making the code more efficient and
consistent with the DMA mapping API's physical address focus.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
[mszyprow: added a fixup]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/56d1a6769b68dfcbf8b26a75a7329aeb8e3c3b6a.1757423202.git.leonro@nvidia.com
Link: https://lore.kernel.org/all/20250910052618.GH341237@unreal/
When CONFIG_DMA_DIRECT_REMAP is enabled, atomic pool pages are
remapped via dma_common_contiguous_remap() using the supplied
pgprot. Currently, the mapping uses
pgprot_dmacoherent(PAGE_KERNEL), which leaves the memory encrypted
on systems with memory encryption enabled (e.g., ARM CCA Realms).
This can cause the DMA layer to fail or crash when accessing the
memory, as the underlying physical pages are not configured as
expected.
Fix this by requesting a decrypted mapping in the vmap() call:
pgprot_decrypted(pgprot_dmacoherent(PAGE_KERNEL))
This ensures that atomic pool memory is consistently mapped
unencrypted.
Cc: stable@vger.kernel.org
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250811181759.998805-1-sdonthineni@nvidia.com
Pull smp fixes from Borislav Petkov:
- Remove an obsolete comment and fix spelling
* tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Remove obsolete comment from takedown_cpu()
smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment
Pull irq fixes from Borislav Petkov:
- Fix a wrong ioremap size in mvebu-gicp
- Remove yet another compile-test case for a driver which needs an
additional dependency
- Fix a lock inversion scenario in the IRQ unit test suite
- Remove an impossible flag situation in gic-v5
- Do not iounmap resources in gic-v5 which are managed by devm
- Make sure stale, left-over interrupts in mvebu-gicp are cleared on
driver init
- Fix a reference counting mishap in msi-lib
- Fix a dereference-before-null-ptr-check case in the riscv-imsic
irqchip driver
* tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/mvebu-gicp: Use resource_size() for ioremap()
irqchip: Build IMX_MU_MSI only on ARM
genirq/test: Resolve irq lock inversion warnings
irqchip/gic-v5: Remove IRQD_RESEND_WHEN_IN_PROGRESS for ITS IRQs
irqchip/gic-v5: iwb: Fix iounmap probe failure path
irqchip/mvebu-gicp: Clear pending interrupts on init
irqchip/msi-lib: Fix fwnode refcount in msi_lib_irq_domain_select()
irqchip/riscv-imsic: Don't dereference before NULL pointer check
Pull locking fix from Borislav Petkov:
- Prevent a futex hash leak due to different mm lifetimes
* tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Move futex cleanup to __mmdrop()
Pull bpf fixes from Alexei Starovoitov:
- Fix memory leak of bpf_scc_info objects (Eduard Zingerman)
- Fix a regression in the 'perf' tool caused by moving UID filtering to
BPF (Ilya Leoshkevich)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
perf bpf-filter: Enable events manually
libbpf: Add the ability to suppress perf event enablement
bpf: Fix memory leak of bpf_scc_info objects
takedown_cpu() has a comment about "all preempt/rcu users must observe
!cpu_active()" which is kind of meaningless in this function. This
comment was originally introduced by commit 6acce3ef84 ("sched: Remove
get_online_cpus() usage") when _cpu_down() was setting cpu_active_mask
and synchronize_rcu()/synchronize_sched() were added after that.
Later commit 40190a78f8 ("sched/hotplug: Convert cpu_[in]active
notifiers to state machine") added a new CPUHP_AP_ACTIVE hotplug
state to set/clear cpu_active_mask. The following commit b2454caa89
("sched/hotplug: Move sync_rcu to be with set_cpu_active(false)")
move the synchronize_*() calls to sched_cpu_deactivate() associated
with the new hotplug state, but left the comment behind.
Remove this comment as it is no longer relevant in takedown_cpu().
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250729191232.664931-1-longman@redhat.com
irq_shutdown_and_deactivate() is normally called with the descriptor lock
held, and interrupts disabled. Nested a few levels down, it grabs the
global irq_resend_lock. Lockdep rightfully complains when interrupts are
not disabled:
CPU0 CPU1
---- ----
lock(irq_resend_lock);
local_irq_disable();
lock(&irq_desc_lock_class);
lock(irq_resend_lock);
<Interrupt>
lock(&irq_desc_lock_class);
...
_raw_spin_lock+0x2b/0x40
clear_irq_resend+0x14/0x70
irq_shutdown_and_deactivate+0x29/0x80
irq_shutdown_depth_test+0x1ce/0x600
kunit_try_run_case+0x90/0x120
Grab the descriptor lock and disable interrupts, to resolve the
problem.
Fixes: 66067c3c8a ("genirq: Add kunit tests for depth counts")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/aJJONEIoIiTSDMqc@google.com
Closes: https://lore.kernel.org/lkml/31a761e4-8f81-40cf-aaf5-d220ba11911c@roeck-us.net/
Pull Kbuild updates from Masahiro Yamada:
"This is the last pull request from me.
I'm grateful to have been able to continue as a maintainer for eight
years. From the next cycle, Nathan and Nicolas will maintain Kbuild.
- Fix a shortcut key issue in menuconfig
- Fix missing rebuild of kheaders
- Sort the symbol dump generated by gendwarfsyms
- Support zboot extraction in scripts/extract-vmlinux
- Migrate gconfig to GTK 3
- Add TAR variable to allow overriding the default tar command
- Hand over Kbuild maintainership"
* tag 'kbuild-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (92 commits)
MAINTAINERS: hand over Kbuild maintenance
kheaders: make it possible to override TAR
kbuild: userprogs: use correct linker when mixing clang and GNU ld
kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c
kconfig: lxdialog: replace strcpy with snprintf in print_autowrap
kconfig: gconf: refactor text_insert_help()
kconfig: gconf: remove unneeded variable in text_insert_msg
kconfig: gconf: use hyphens in signals
kconfig: gconf: replace GtkImageMenuItem with GtkMenuItem
kconfig: gconf: Fix Back button behavior
kconfig: gconf: fix single view to display dependent symbols correctly
scripts: add zboot support to extract-vmlinux
gendwarfksyms: order -T symtypes output by name
gendwarfksyms: use preferred form of sizeof for allocation
kconfig: qconf: confine {begin,end}Group to constructor and destructor
kconfig: qconf: fix ConfigList::updateListAllforAll()
kconfig: add a function to dump all menu entries in a tree-like format
kconfig: gconf: show GTK version in About dialog
kconfig: gconf: replace GtkHPaned and GtkVPaned with GtkPaned
kconfig: gconf: replace GdkColor with GdkRGBA
...
Pull perf fixes from Thomas Gleixner:
"Perf fixes for perf_mmap() reference counting to prevent potential
reference count leaks which are caused by:
- VMA splits, which change the offset or size of a mapping, which
causes perf_mmap_close() to ignore the unmap or unmap the wrong
buffer.
- Several internal issues of perf_mmap(), which can cause reference
count leaks in the perf mmap, corrupt accounting or cause leaks in
perf drivers.
The main fix is to prevent VMA splits by implementing the
[may_]split() callback for vm operations.
The other issues are addressed by rearranging code, early returns on
failure and invocation of cleanups.
Also provide a selftest to validate the fixes.
The reference counting should be converted to refcount_t, but that
requires larger refactoring of the code and will be done once these
fixes are upstream"
* tag 'perf-fixes-27504' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git:
selftests/perf_events: Add a mmap() correctness test
perf/core: Prevent VMA split of buffer mappings
perf/core: Handle buffer mapping fail correctly in perf_mmap()
perf/core: Exit early on perf_mmap() fail
perf/core: Don't leak AUX buffer refcount on allocation failure
perf/core: Preserve AUX buffer allocation failure result
Commit 86cdd2fdc4 ("kheaders: make headers archive reproducible")
introduced a number of options specific to GNU tar to the `tar`
invocation in `gen_kheaders.sh` script. This causes the script to fail
to work on systems where `tar` is not GNU tar. This can occur e.g.
on recent Gentoo Linux installations that support using bsdtar from
libarchive instead.
Add a `TAR` make variable to make it possible to override the tar
executable used, e.g. by specifying:
make TAR=gtar
Link: https://bugs.gentoo.org/884061
Reported-by: Sam James <sam@gentoo.org>
Tested-by: Sam James <sam@gentoo.org>
Co-developed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Michał Górny <mgorny@gentoo.org>
Signed-off-by: Sam James <sam@gentoo.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
The perf mmap code is careful about mmap()'ing the user page with the
ringbuffer and additionally the auxiliary buffer, when the event supports
it. Once the first mapping is established, subsequent mapping have to use
the same offset and the same size in both cases. The reference counting for
the ringbuffer and the auxiliary buffer depends on this being correct.
Though perf does not prevent that a related mapping is split via mmap(2),
munmap(2) or mremap(2). A split of a VMA results in perf_mmap_open() calls,
which take reference counts, but then the subsequent perf_mmap_close()
calls are not longer fulfilling the offset and size checks. This leads to
reference count leaks.
As perf already has the requirement for subsequent mappings to match the
initial mapping, the obvious consequence is that VMA splits, caused by
resizing of a mapping or partial unmapping, have to be prevented.
Implement the vm_operations_struct::may_split() callback and return
unconditionally -EINVAL.
That ensures that the mapping offsets and sizes cannot be changed after the
fact. Remapping to a different fixed address with the same size is still
possible as it takes the references for the new mapping and drops those of
the old mapping.
Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27504
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: stable@vger.kernel.org
After successful allocation of a buffer or a successful attachment to an
existing buffer perf_mmap() tries to map the buffer read only into the page
table. If that fails, the already set up page table entries are zapped, but
the other perf specific side effects of that failure are not handled. The
calling code just cleans up the VMA and does not invoke perf_mmap_close().
This leaks reference counts, corrupts user->vm accounting and also results
in an unbalanced invocation of event::event_mapped().
Cure this by moving the event::event_mapped() invocation before the
map_range() call so that on map_range() failure perf_mmap_close() can be
invoked without causing an unbalanced event::event_unmapped() call.
perf_mmap_close() undoes the reference counts and eventually frees buffers.
Fixes: b709eb872e ("perf: map pages in advance")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
When perf_mmap() fails to allocate a buffer, it still invokes the
event_mapped() callback of the related event. On X86 this might increase
the perf_rdpmc_allowed reference counter. But nothing undoes this as
perf_mmap_close() is never called in this case, which causes another
reference count leak.
Return early on failure to prevent that.
Fixes: 1e0fb9ec67 ("perf: Add pmu callbacks to track event mapping and unmapping")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
Failure of the AUX buffer allocation leaks the reference count.
Set the reference count to 1 only when the allocation succeeds.
Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
A recent overhaul sets the return value to 0 unconditionally after the
allocations, which causes reference count leaks and corrupts the user->vm
accounting.
Preserve the AUX buffer allocation failure return value, so that the
subsequent code works correctly.
Fixes: 0983593f32 ("perf/core: Lift event->mmap_mutex in perf_mmap()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
Pull more MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "mseal cleanups" (Lorenzo Stoakes)
Some mseal cleaning with no intended functional change.
- "Optimizations for khugepaged" (David Hildenbrand)
Improve khugepaged throughput by batching PTE operations for large
folios. This gain is mainly for arm64.
- "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes" (Mike Rapoport)
A bugfix, additional debug code and cleanups to the execmem code.
- "mm/shmem, swap: bugfix and improvement of mTHP swap in" (Kairui Song)
Bugfixes, cleanups and performance improvememnts to the mTHP swapin
code"
* tag 'mm-stable-2025-08-03-12-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (38 commits)
mm: mempool: fix crash in mempool_free() for zero-minimum pools
mm: correct type for vmalloc vm_flags fields
mm/shmem, swap: fix major fault counting
mm/shmem, swap: rework swap entry and index calculation for large swapin
mm/shmem, swap: simplify swapin path and result handling
mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
mm/shmem, swap: tidy up swap entry splitting
mm/shmem, swap: tidy up THP swapin checks
mm/shmem, swap: avoid redundant Xarray lookup during swapin
x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations
x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations
execmem: drop writable parameter from execmem_fill_trapping_insns()
execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP)
execmem: move execmem_force_rw() and execmem_restore_rox() before use
execmem: rework execmem_cache_free()
execmem: introduce execmem_alloc_rw()
execmem: drop unused execmem_update_copy()
mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got dropped
mm/rmap: add anon_vma lifetime debug check
mm: remove mm/io-mapping.c
...
Pull printk updates from Petr Mladek:
- Add new "hash_pointers=[auto|always|never]" boot parameter to force
the hashing even with "slab_debug" enabled
- Allow to stop CPU, after losing nbcon console ownership during
panic(), even without proper NMI
- Allow to use the printk kthread immediately even for the 1st
registered nbcon
- Compiler warning removal
* tag 'printk-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: nbcon: Allow reacquire during panic
printk: Allow to use the printk kthread immediately even for 1st nbcon
slab: Decouple slab_debug and no_hash_pointers
vsprintf: Use __diag macros to disable '-Wsuggest-attribute=format'
compiler-gcc.h: Introduce __diag_GCC_all
With the seqcount moved out of the group into a global psi_seq,
re-initializing the seqcount on group creation is causing seqcount
corruption.
Fixes: 570c8efd5e ("sched/psi: Optimize psi_group_change() cpu_clock() usage")
Reported-by: Chris Mason <clm@meta.com>
Suggested-by: Beata Michalska <beata.michalska@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull non-MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
us closer to being able to remove page->mapping
- "relayfs: misc changes" (Jason Xing) does some maintenance and
minor feature addition work in relayfs
- "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
us from static preallocation of the kdump crashkernel's working
memory over to dynamic allocation. So the difficulty of a-priori
estimation of the second kernel's needs is removed and the first
kernel obtains extra memory
- "generalize panic_print's dump function to be used by other
kernel parts" (Feng Tang) implements some consolidation and
rationalization of the various ways in which a failing kernel
splats information at the operator
* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
tools/getdelays: add backward compatibility for taskstats version
kho: add test for kexec handover
delaytop: enhance error logging and add PSI feature description
samples: Kconfig: fix spelling mistake "instancess" -> "instances"
fat: fix too many log in fat_chain_add()
scripts/spelling.txt: add notifer||notifier to spelling.txt
xen/xenbus: fix typo "notifer"
net: mvneta: fix typo "notifer"
drm/xe: fix typo "notifer"
cxl: mce: fix typo "notifer"
KVM: x86: fix typo "notifer"
MAINTAINERS: add maintainers for delaytop
ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
ucount: fix atomic_long_inc_below() argument type
kexec: enable CMA based contiguous allocation
stackdepot: make max number of pools boot-time configurable
lib/xxhash: remove unused functions
init/Kconfig: restore CONFIG_BROKEN help text
lib/raid6: update recov_rvv.c zero page usage
docs: update docs after introducing delaytop
...
Pull more tracing updates from Steven Rostedt:
- Remove unneeded goto out statements
Over time, the logic was restructured but left a "goto out" where the
out label simply did a "return ret;". Instead of jumping to this out
label, simply return immediately and remove the out label.
- Add guard(ring_buffer_nest)
Some calls to the tracing ring buffer can happen when the ring buffer
is already being written to at the same context (for example, a
trace_printk() in between a ring_buffer_lock_reserve() and a
ring_buffer_unlock_commit()).
In order to not trigger the recursion detection, these functions use
ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard()
for these functions so that their use cases can be simplified and not
need to use goto for the release.
- Clean up the tracing code with guard() and __free() logic
There were several locations that were prime candidates for using
guard() and __free() helpers. Switch them over to use them.
- Fix output of function argument traces for unsigned int values
The function tracer with "func-args" option set will record up to 6
argument registers and then use BTF to format them for human
consumption when the trace file is read. There are several arguments
that are "unsigned long" and even "unsigned int" that are either and
address or a mask. It is easier to understand if they were printed
using hexadecimal instead of decimal. The old method just printed all
non-pointer values as signed integers, which made it even worse for
unsigned integers.
For instance, instead of:
__local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs
show:
__local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs"
* tag 'trace-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Have unsigned int function args displayed as hexadecimal
ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)
tracing: Use __free(kfree) in trace.c to remove gotos
tracing: Add guard() around locks and mutexes in trace.c
tracing: Add guard(ring_buffer_nest)
tracing: Remove unneeded goto out logic
Pull module updates from Daniel Gomez:
"This is a small set of changes for modules, primarily to extend module
users to use the module data structures in combination with the
already no-op stub module functions, even when support for modules is
disabled in the kernel configuration. This change follows the kernel's
coding style for conditional compilation and allows kunit code to drop
all CONFIG_MODULES ifdefs, which is also part of the changes. This
should allow others part of the kernel to do the same cleanup.
The remaining changes include a fix for module name length handling
which could potentially lead to the removal of an incorrect module,
and various cleanups"
* tag 'modules-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
module: Rename MAX_PARAM_PREFIX_LEN to __MODULE_NAME_LEN
tracing: Replace MAX_PARAM_PREFIX_LEN with MODULE_NAME_LEN
module: Restore the moduleparam prefix length check
module: Remove unnecessary +1 from last_unloaded_module::name size
module: Prevent silent truncation of module name in delete_module(2)
kunit: test: Drop CONFIG_MODULE ifdeffery
module: make structure definitions always visible
module: move 'struct module_use' to internal.h
The type of u argument of atomic_long_inc_below() should be long to avoid
unwanted truncation to int.
The patch fixes the wrong argument type of an internal function to
prevent unwanted argument truncation. It fixes an internal locking
primitive; it should not have any direct effect on userspace.
Mark said
: AFAICT there's no problem in practice because atomic_long_inc_below()
: is only used by inc_ucount(), and it looks like the value is
: constrained between 0 and INT_MAX.
:
: In inc_ucount() the limit value is taken from
: user_namespace::ucount_max[], and AFAICT that's only written by
: sysctls, to the table setup by setup_userns_sysctls(), where
: UCOUNT_ENTRY() limits the value between 0 and INT_MAX.
:
: This is certainly a cleanup, but there might be no functional issue in
: practice as above.
Link: https://lkml.kernel.org/r/20250721174610.28361-1-ubizjak@gmail.com
Fixes: f9c82a4ea8 ("Increase size of ucounts to atomic_long_t")
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: MengEn Sun <mengensun@tencent.com>
Cc: "Thomas Weißschuh" <linux@weissschuh.net>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When booting a new kernel with kexec_file, the kernel picks a target
location that the kernel should live at, then allocates random pages,
checks whether any of those patches magically happens to coincide with a
target address range and if so, uses them for that range.
For every page allocated this way, it then creates a page list that the
relocation code - code that executes while all CPUs are off and we are
just about to jump into the new kernel - copies to their final memory
location. We can not put them there before, because chances are pretty
good that at least some page in the target range is already in use by the
currently running Linux environment. Copying is happening from a single
CPU at RAM rate, which takes around 4-50 ms per 100 MiB.
All of this is inefficient and error prone.
To successfully kexec, we need to quiesce all devices of the outgoing
kernel so they don't scribble over the new kernel's memory. We have seen
cases where that does not happen properly (*cough* GIC *cough*) and hence
the new kernel was corrupted. This started a month long journey to root
cause failing kexecs to eventually see memory corruption, because the new
kernel was corrupted severely enough that it could not emit output to tell
us about the fact that it was corrupted. By allocating memory for the
next kernel from a memory range that is guaranteed scribbling free, we can
boot the next kernel up to a point where it is at least able to detect
corruption and maybe even stop it before it becomes severe. This
increases the chance for successful kexecs.
Since kexec got introduced, Linux has gained the CMA framework which can
perform physically contiguous memory mappings, while keeping that memory
available for movable memory when it is not needed for contiguous
allocations. The default CMA allocator is for DMA allocations.
This patch adds logic to the kexec file loader to attempt to place the
target payload at a location allocated from CMA. If successful, it uses
that memory range directly instead of creating copy instructions during
the hot phase. To ensure that there is a safety net in case anything goes
wrong with the CMA allocation, it also adds a flag for user space to force
disable CMA allocations.
Using CMA allocations has two advantages:
1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the
hot phase.
2) More robust. Even if by accident some page is still in use for DMA,
the new kernel image will be safe from that access because it resides
in a memory region that is considered allocated in the old kernel and
has a chance to reinitialize that component.
Link: https://lkml.kernel.org/r/20250610085327.51817-1-graf@amazon.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
env->scc_info array contains references to bpf_scc_info objects
allocated lazily in verifier.c:scc_visit_alloc().
env->scc_cnt was supposed to track env->scc_info array size
in order to free referenced objects in verifier.c:free_states().
Fix initialization of env->scc_cnt that was omitted in
verifier.c:compute_scc().
To reproduce the bug:
- build with CONFIG_DEBUG_KMEMLEAK
- boot and load bpf program with loops, e.g.:
./veristat -q pyperf180.bpf.o
- initiate memleak scan and check results:
echo scan > /sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak
Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: Jens Axboe <axboe@kernel.dk>
Closes: https://lore.kernel.org/bpf/CAADnVQKXUWg9uRCPD5ebRXwN4dmBCRUFFM7kN=GxymYz3zU25A@mail.gmail.com/T/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250801232330.1800436-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull bpf fixes from Alexei Starovoitov:
- Fix kCFI failures in JITed BPF code on arm64 (Sami Tolvanen, Puranjay
Mohan, Mark Rutland, Maxwell Bland)
- Disallow tail calls between BPF programs that use different cgroup
local storage maps to prevent out-of-bounds access (Daniel Borkmann)
- Fix unaligned access in flow_dissector and netfilter BPF programs
(Paul Chaignon)
- Avoid possible use of uninitialized mod_len in libbpf (Achill
Gilgenast)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Test for unaligned flow_dissector ctx access
bpf: Improve ctx access verifier error message
bpf: Check netfilter ctx accesses are aligned
bpf: Check flow_dissector ctx accesses are aligned
arm64/cfi,bpf: Support kCFI + BPF on arm64
cfi: Move BPF CFI types and helpers to generic code
cfi: add C CFI type macro
libbpf: Avoid possible use of uninitialized mod_len
bpf: Fix oob access in cgroup local storage
bpf: Move cgroup iterator helpers to bpf.h
bpf: Move bpf map owner out of common struct
bpf: Add cookie object to bpf maps
Most function arguments that are passed in as unsigned int or unsigned
long are better displayed as hexadecimal than normal integer. For example,
the functions:
static void __create_object(unsigned long ptr, size_t size,
int min_count, gfp_t gfp, unsigned int objflags);
static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
size_t len);
void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
Show up in the trace as:
__create_object(ptr=-131387050520576, size=4096, min_count=1, gfp=3264, objflags=0) <-kmem_cache_alloc_noprof
stack_access_ok(state=0xffffc9000233fc98, _addr=-60473102566256, len=8) <-unwind_next_frame
__local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs
Instead, by displaying unsigned as hexadecimal, they look more like this:
__create_object(ptr=0xffff8881028d2080, size=0x280, min_count=1, gfp=0x82820, objflags=0x0) <-kmem_cache_alloc_node_noprof
stack_access_ok(state=0xffffc90000003938, _addr=0xffffc90000003930, len=0x8) <-unwind_next_frame
__local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs
Which is much easier to understand as most unsigned longs are usually just
pointers. Even the "unsigned int cnt" in __local_bh_disable_ip() looks
better as hexadecimal as a lot of flags are passed as unsigned.
Changes since v2: https://lore.kernel.org/20250801111453.01502861@gandalf.local.home
- Use btf_int_encoding() instead of open coding it (Martin KaFai Lau)
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Link: https://lore.kernel.org/20250801165601.7770d65c@gandalf.local.home
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull virtio updates from Michael Tsirkin:
- vhost can now support legacy threading if enabled in Kconfig
- vsock memory allocation strategies for large buffers have been
improved, reducing pressure on kmalloc
- vhost now supports the in-order feature. guest bits missed the merge
window.
- fixes, cleanups all over the place
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (30 commits)
vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers
vsock/virtio: Rename virtio_vsock_skb_rx_put()
vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers
vsock/virtio: Move SKB allocation lower-bound check to callers
vsock/virtio: Rename virtio_vsock_alloc_skb()
vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page
vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put()
vsock/virtio: Validate length in packet header before skb_put()
vhost/vsock: Avoid allocating arbitrarily-sized SKBs
vhost_net: basic in_order support
vhost: basic in order support
vhost: fail early when __vhost_add_used() fails
vhost: Reintroduce kthread API and add mode selection
vdpa: Fix IDR memory leak in VDUSE module exit
vdpa/mlx5: Fix release of uninitialized resources on error path
vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit
virtio: virtio_dma_buf: fix missing parameter documentation
vhost: Fix typos
vhost: vringh: Remove unused functions
vhost: vringh: Remove unused iotlb functions
...
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Allow built-in drivers, not just modular drivers, to use async
initial probing (Lukas Wunner)
- Support Immediate Readiness even on devices with no PM Capability
(Sean Christopherson)
- Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the
required delay between a reset and sending config requests to a
device (Niklas Cassel)
- Add pci_is_display() to check for "Display" base class and use it
in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello)
- Allow 'isolated PCI functions' (multi-function devices without a
function 0) for LoongArch, similar to s390 and jailhouse (Huacai
Chen)
Power control:
- Add ability to enable optional slot clock for cases where the PCIe
host controller and the slot are supplied by different clocks
(Marek Vasut)
PCIe native device hotplug:
- Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by
misinterpreting a config read failure after a device has been
removed (Lukas Wunner)
- Avoid creating a useless PCIe port service device for pciehp if the
slot is handled by the ACPI hotplug driver (Lukas Wunner)
- Ignore ACPI hotplug slots when calculating depth of pciehp hotplug
ports (Lukas Wunner)
Virtualization:
- Save VF resizable BAR state and restore it after reset (Michał
Winiarski)
- Allow IOV resources (VF BARs) to be resized (Michał Winiarski)
- Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size
(Michał Winiarski)
Endpoint framework:
- Add RC-to-EP doorbell support using platform MSI controller,
including a test case (Frank Li)
- Allow BAR assignment via configfs so platforms have flexibility in
determining BAR usage (Jerome Brunet)
Native PCIe controller drivers:
- Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie,
axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to
DT schema format (Rob Herring)
- Use dev_fwnode() instead of of_fwnode_handle() to remove OF
dependency in altera (fixes an unused variable), designware-host,
mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma,
xilinx-nwl (Jiri Slaby, Arnd Bergmann)
- Convert aardvark, altera, brcmstb, designware-host, iproc,
mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx,
xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to
using msi_create_parent_irq_domain() instead; this makes the
interrupt controller per-PCI device, allows dynamic allocation of
vectors after initialization, and allows support of IMS (Nam Cao)
APM X-Gene PCIe controller driver:
- Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug
bits, use device-managed memory allocations, and clean things up
(Marc Zyngier)
- Probe xgene-msi as a standard platform driver rather than a
subsys_initcall (Marc Zyngier)
Broadcom STB PCIe controller driver:
- Add optional DT 'num-lanes' property and if present, use it to
override the Maximum Link Width advertised in Link Capabilities
(Jim Quinlan)
Cadence PCIe controller driver:
- Use PCIe Message routing types from the PCI core rather than
defining private ones (Hans Zhang)
Freescale i.MX6 PCIe controller driver:
- Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu)
- Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features
(Richard Zhu)
- Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can
trigger doorbel on Endpoint (Frank Li)
- Remove apps_reset (LTSSM_EN) from
imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug
regression on i.MX8MM (Richard Zhu)
- Delay Endpoint link start until configfs 'start' written (Richard
Zhu)
Intel VMD host bridge driver:
- Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo)
Qualcomm PCIe controller driver:
- Add DT binding and driver support for SA8255p, which supports ECAM
for Configuration Space access (Mayank Rana)
- Update DT binding and driver to describe PHYs and per-Root Port
resets in a Root Port stanza and deprecate describing them in the
host bridge; this makes it possible to support multiple Root Ports
in the future (Krishna Chaitanya Chundru)
- Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang)
- Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang)
- Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT
bindings (Konrad Dybcio)
- Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue
Zhang)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Rockchip PCIe controller driver:
- Drop unused PCIe Message routing and code definitions (Hans Zhang)
- Remove several unused header includes (Hans Zhang)
- Use standard PCIe config register definitions instead of
rockchip-specific redefinitions (Geraldo Nascimento)
- Set Target Link Speed to 5.0 GT/s before retraining so we have a
chance to train at a higher speed (Geraldo Nascimento)
Rockchip DesignWare PCIe controller driver:
- Prevent race between link training and register update via DBI by
inhibiting link training after hot reset and link down (Wilfred
Mallawa)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Sophgo PCIe controller driver:
- Add DT binding and driver for Sophgo SG2044 PCIe controller driver
in Root Complex mode (Inochi Amaoto)
Synopsys DesignWare PCIe controller driver:
- Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on
Ports that support > 5.0 GT/s. Slower Ports still rely on the
not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while
waiting for the Link (Niklas Cassel)"
* tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits)
dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset
dt-bindings: PCI: Remove 83xx-512x-pci.txt
dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema
dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema
dt-bindings: PCI: Convert apm,xgene-pcie to DT schema
dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema
dt-bindings: PCI: Convert st,spear1340-pcie to DT schema
PCI: Move is_pciehp check out of pciehp_is_native()
PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge
PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge
PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
selftests: pci_endpoint: Add doorbell test case
misc: pci_endpoint_test: Add doorbell test case
PCI: endpoint: pci-epf-test: Add doorbell test support
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment
PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode
PCI: vmd: Switch to msi_create_parent_irq_domain()
PCI: vmd: Convert to lock guards
...
Some calls to the tracing ring buffer can happen when the ring buffer is
already being written to by the same context (for example, a
trace_printk() in between a ring_buffer_lock_reserve() and a
ring_buffer_unlock_commit()).
In order to not trigger the recursion detection, these functions use
ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard() for
these functions so that their use cases can be simplified and not need to
use goto for the release.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.710501021@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>