linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-03-22 23:46:49 +08:00

Author	SHA1	Message	Date
Linus Torvalds	c574fb2ed7	Merge tag 'locking-futex-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex updates from Thomas Gleixner: "A set of updates for futexes and related selftests: - Plug the ptrace_may_access() race against a concurrent exec() which allows to pass the check before the target's process transition in exec() by taking a read lock on signal->ext_update_lock. - A large set of cleanups and enhancement to the futex selftests. The bulk of changes is the conversion to the kselftest harness" * tag 'locking-futex-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) selftest/futex: Fix spelling mistake "boundarie" -> "boundary" selftests/futex: Remove logging.h file selftests/futex: Drop logging.h include from futex_numa selftests/futex: Refactor futex_numa_mpol with kselftest_harness.h selftests/futex: Refactor futex_priv_hash with kselftest_harness.h selftests/futex: Refactor futex_waitv with kselftest_harness.h selftests/futex: Refactor futex_requeue with kselftest_harness.h selftests/futex: Refactor futex_wait with kselftest_harness.h selftests/futex: Refactor futex_wait_private_mapped_file with kselftest_harness.h selftests/futex: Refactor futex_wait_unitialized_heap with kselftest_harness.h selftests/futex: Refactor futex_wait_wouldblock with kselftest_harness.h selftests/futex: Refactor futex_wait_timeout with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi_signal_restart with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi_mismatched_ops with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi with kselftest_harness.h selftests: kselftest: Create ksft_print_dbg_msg() futex: Don't leak robust_list pointer on exec race selftest/futex: Compile also with libnuma < 2.0.16 selftest/futex: Reintroduce "Memory out of range" numa_mpol's subtest selftest/futex: Make the error check more precise for futex_numa_mpol ...	2025-09-30 16:07:10 -07:00
Linus Torvalds	d8de3685f1	Merge tag 'smp-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp doc fixlet from Thomas Gleixner: "An update of the stale smp_call_function_many() documentation to bring it back in sync with the actual implementation" * tag 'smp-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Fix up and expand the smp_call_function_many() kerneldoc	2025-09-30 16:04:52 -07:00
Linus Torvalds	3b2074c77d	Merge tag 'irq-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq core updates from Thomas Gleixner: "A set of updates for the interrupt core subsystem: - Introduce irq_chip_[startup\|shutdown]_parent() to prepare for addressing a few short comings in the PCI/MSI interrupt subsystem. It allows to utilize the interrupt chip startup/shutdown callbacks for initializing the interrupt chip hierarchy properly on certain RISCV implementations and provides a mechanism to reduce the overhead of masking and unmasking PCI/MSI interrupts during operation when the underlying MSI provider can mask the interrupt. The actual usage comes with the interrupt driver pull request. - Add generic error handling for devm_request__irq() This allows to remove the zoo of random error printk's all over the usage sites. - Add a mechanism to warn about long-running interrupt handlers Long running interrupt handlers can introduce latencies and tracking them down is a tedious task. The tracking has to be enabled with a threshold on the kernel command line and utilizes a static branch to remove the overhead when disabled. - Update and extend the selftests which validate the CPU hotplug interrupt migration logic - Allow dropping the per CPU softirq lock on PREEMPT_RT kernels, which causes contention and latencies all over the place. The serialization requirements have been pushed down into the actual affected usage sites already. - The usual small cleanups and improvements" tag 'irq-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT softirq: Provide a handshake for canceling tasklets via polling genirq/test: Ensure CPU 1 is online for hotplug test genirq/test: Drop CONFIG_GENERIC_IRQ_MIGRATION assumptions genirq/test: Depend on SPARSE_IRQ genirq/test: Fail early if interrupt request fails genirq/test: Factor out fake-virq setup genirq/test: Select IRQ_DOMAIN genirq/test: Fix depth tests on architectures with NOREQUEST by default. genirq: Add support for warning on long-running interrupt handlers genirq/devres: Add error handling in devm_request_*_irq() genirq: Add irq_chip_(startup/shutdown)_parent() genirq: Remove GENERIC_IRQ_LEGACY	2025-09-30 15:55:25 -07:00
Linus Torvalds	1d17e808cf	Merge tag 'core-rseq-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq updates from Thomas Gleixner: "Two fixes for RSEQ: - Protect the event mask modification against the membarrier() IPI as otherwise the RmW operation is unprotected and events might be lost - Fix the weak symbol reference in rseq selftests The current weak RSEQ symbols definitions which were added to allow static linkage are not working correctly as they effectively re-define the glibc symbols leading to multiple versions of the symbols when compiled with -fno-common. Mark them as 'extern' to convert them from weak symbol definitions to weak symbol references. That works with static and dynamic linkage independent of -fcommon and -fno-common" * tag 'core-rseq-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq/selftests: Use weak symbol reference, not definition, to link with glibc rseq: Protect event mask against membarrier IPI	2025-09-30 15:06:33 -07:00
Linus Torvalds	e4dcbdff11	Merge tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Core perf code updates: - Convert mmap() related reference counts to refcount_t. This is in reaction to the recently fixed refcount bugs, which could have been detected earlier and could have mitigated the bug somewhat (Thomas Gleixner, Peter Zijlstra) - Clean up and simplify the callchain code, in preparation for sframes (Steven Rostedt, Josh Poimboeuf) Uprobes updates: - Add support to optimize usdt probes on x86-64, which gives a substantial speedup (Jiri Olsa) - Cleanups and fixes on x86 (Peter Zijlstra) PMU driver updates: - Various optimizations and fixes to the Intel PMU driver (Dapeng Mi) Misc cleanups and fixes: - Remove redundant __GFP_NOWARN (Qianfeng Rong)" * tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) selftests/bpf: Fix uprobe_sigill test for uprobe syscall error value uprobes/x86: Return error from uprobe syscall when not called from trampoline perf: Skip user unwind if the task is a kernel thread perf: Simplify get_perf_callchain() user logic perf: Use current->flags & PF_KTHREAD\|PF_USER_WORKER instead of current->mm == NULL perf: Have get_perf_callchain() return NULL if crosstask and user are set perf: Remove get_perf_callchain() init_nr argument perf/x86: Print PMU counters bitmap in x86_pmu_show_pmu_cap() perf/x86/intel: Add ICL_FIXED_0_ADAPTIVE bit into INTEL_FIXED_BITS_MASK perf/x86/intel: Change macro GLOBAL_CTRL_EN_PERF_METRICS to BIT_ULL(48) perf/x86: Add PERF_CAP_PEBS_TIMING_INFO flag perf/x86/intel: Fix IA32_PMC_x_CFG_B MSRs access error perf/x86/intel: Use early_initcall() to hook bts_init() uprobes: Remove redundant __GFP_NOWARN selftests/seccomp: validate uprobe syscall passes through seccomp seccomp: passthrough uprobe systemcall without filtering selftests/bpf: Fix uprobe syscall shadow stack test selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe selftests/bpf: Add uprobe_regs_equal test selftests/bpf: Add optimized usdt variant for basic usdt test ...	2025-09-30 11:11:21 -07:00
Linus Torvalds	6c7340a7a8	Merge tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core scheduler changes: - Make migrate_{en,dis}able() inline, to improve performance (Menglong Dong) - Move STDL_INIT() functions out-of-line (Peter Zijlstra) - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra) Fair scheduling: - Defer throttling to when tasks exit to user-space, to reduce the chance & impact of throttle-preemption with held locks and other resources (Aaron Lu, Valentin Schneider) - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the warning was getting triggered on certain topologies (Peter Zijlstra) Misc cleanups & fixes: - Header cleanups (Menglong Dong) - Fix race in push_dl_task() (Harshit Agarwal)" * tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix some typos in include/linux/preempt.h sched: Make migrate_{en,dis}able() inline rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c sched/fair: Do not balance task to a throttled cfs_rq sched/fair: Do not special case tasks in throttled hierarchy sched/fair: update_cfs_group() for throttled cfs_rqs sched/fair: Propagate load for throttled cfs_rq sched/fair: Get rid of throttled_lb_pair() sched/fair: Task based throttle time accounting sched/fair: Switch to task based throttle model sched/fair: Implement throttle task work and related helpers sched/fair: Add related data structure for task based throttle sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig sched: Move STDL_INIT() functions out-of-line sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask() sched/deadline: Fix race in push_dl_task()	2025-09-30 10:35:11 -07:00
Linus Torvalds	755fa5b4fb	Merge tag 'cgroup-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Extensive cpuset code cleanup and refactoring work with no functional changes: CPU mask computation logic refactoring, introducing new helpers, removing redundant code paths, and improving error handling for better maintainability. - A few bug fixes to cpuset including fixes for partition creation failures when isolcpus is in use, missing error returns, and null pointer access prevention in free_tmpmasks(). - Core cgroup changes include replacing the global percpu_rwsem with per-threadgroup rwsem when writing to cgroup.procs for better scalability, workqueue conversions to use WQ_PERCPU and system_percpu_wq to prepare for workqueue default switching from percpu to unbound, and removal of unused code including the post_attach callback. - New cgroup.stat.local time accounting feature that tracks frozen time duration. - Misc changes including selftests updates (new freezer time tests and backward compatibility fixes), documentation sync, string function safety improvements, and 64-bit division fixes. * tag 'cgroup-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (39 commits) cpuset: remove is_prs_invalid helper cpuset: remove impossible warning in update_parent_effective_cpumask cpuset: remove redundant special case for null input in node mask update cpuset: fix missing error return in update_cpumask cpuset: Use new excpus for nocpu error check when enabling root partition cpuset: fix failure to enable isolated partition when containing isolcpus Documentation: cgroup-v2: Sync manual toctree cpuset: use partition_cpus_change for setting exclusive cpus cpuset: use parse_cpulist for setting cpus.exclusive cpuset: introduce partition_cpus_change cpuset: refactor cpus_allowed_validate_change cpuset: refactor out validate_partition cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpers cpuset: refactor CPU mask buffer parsing logic cpuset: Refactor exclusive CPU mask computation logic cpuset: change return type of is_partition_[in]valid to bool cpuset: remove unused assignment to trialcs->partition_root_state cpuset: move the root cpuset write check earlier cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lock cgroup: Remove redundant rcu_read_lock/unlock() in spin_lock ...	2025-09-30 09:55:41 -07:00
Linus Torvalds	77fc3f6696	Merge tag 'wq-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue updates from Tejun Heo: - WQ_PERCPU was added to remaining alloc_workqueue() users and system_wq usage was replaced with system_percpu_wq and system_unbound_wq with system_dfl_wq. These are equivalent conversions with no functional changes, preparing for switching default to unbound workqueues from percpu. - A handshake mechanism was added for canceling BH workers to avoid live lock scenarios under PREEMPT_RT. - Unnecessary rcu_read_lock/unlock() calls were dropped in wq_watchdog_timer_fn() and workqueue_congested(). - Documentation was fixed to resolve texinfodocs warnings. * tag 'wq-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix texinfodocs warning for WQ_* flags reference workqueue: WQ_PERCPU added to alloc_workqueue users workqueue: replace use of system_wq with system_percpu_wq workqueue: replace use of system_unbound_wq with system_dfl_wq workqueue: Provide a handshake for canceling BH workers workqueue: Remove rcu_read_lock/unlock() in wq_watchdog_timer_fn() workqueue: Remove redundant rcu_read_lock/unlock() in workqueue_congested()	2025-09-30 09:31:09 -07:00
Linus Torvalds	a23cd25bae	Merge tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Code organization cleanup. Separate internal types and accessors to ext_internal.h to reduce the size of ext.c and improve maintainability. - Prepare for cgroup sub-scheduler support by adding @sch parameter to various functions and helpers, reorganizing scheduler instance handling, and dropping obsolete helpers like scx_kf_exit() and kf_cpu_valid(). - Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to provide safer access patterns with proper RCU protection. scx_bpf_cpu_rq() is deprecated with warnings due to potential race conditions. - Improve debugging with migration-disabled counter in error state dumps, SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and other enhancements to help diagnose issues. - Use cgroup_lock/unlock() for cgroup synchronization instead of scx_cgroup_rwsem based synchronization. This is simpler and allows enable/disable paths to synchronize against cgroup changes independent of the CPU controller. - rhashtable_lookup() replacement to avoid redundant RCU locking was reverted due to RCU usage warnings. Will be redone once rhashtable is updated to use rcu_dereference_all(). - Other misc updates and fixes including bypass handling improvements, scx_task_iter_relock() improvements, tools/sched_ext updates, and compatibility helpers. * tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (28 commits) Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()" sched_ext: Misc updates around scx_sched instance pointer sched_ext: Drop scx_kf_exit() and scx_kf_error() sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit() sched_ext: Drop kf_cpu_valid() sched_ext: Add the @sch parameter to ext_idle helpers sched_ext: Add the @sch parameter to __bstr_format() sched_ext: Separate out scx_kick_cpu() and add @sch to it tools/sched_ext: scx_qmap: Make debug output quieter by default sched_ext: Make qmap dump operation non-destructive sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() sched_ext: Use bitfields for boolean warning flags sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() sched_ext: Improve SCX_KF_DISPATCH comment sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() sched_ext: Verify RCU protection in scx_bpf_cpu_curr() sched_ext: Add migration-disabled counter to error state dump sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning tools/sched_ext: Add compat helper for scx_bpf_cpu_curr() sched_ext: deprecation warn for scx_bpf_cpu_rq() ...	2025-09-30 09:05:07 -07:00
Linus Torvalds	56a0810d8c	Merge tag 'audit-pr-20250926' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: - Proper audit support for multiple LSMs As the audit subsystem predated the work to enable multiple LSMs, some additional work was needed to support logging the different LSM labels for the subjects/tasks and objects on the system. Casey's patches add new auxillary records for subjects and objects that convey the additional labels. - Ensure fanotify audit events are always generated Generally speaking security relevant subsystems always generate audit events, unless explicitly ignored. However, up to this point fanotify events had been ignored by default, but starting with this pull request fanotify follows convention and generates audit events by default. - Replace an instance of strcpy() with strscpy() - Minor indentation, style, and comment fixes * tag 'audit-pr-20250926' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: fix skb leak when audit rate limit is exceeded audit: init ab->skb_list earlier in audit_buffer_alloc() audit: add record for multiple object contexts audit: add record for multiple task security contexts lsm: security_lsmblob_to_secctx module selection audit: create audit_stamp structure audit: add a missing tab audit: record fanotify event regardless of presence of rules audit: fix typo in auditfilter.c comment audit: Replace deprecated strcpy() with strscpy() audit: fix indentation in audit_log_exit()	2025-09-30 08:22:16 -07:00
Linus Torvalds	417552999d	Merge tag 'powerpc-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Madhavan Srinivasan: - powerpc support for BPF arena and arena atomics - Patches to switch to msi parent domain (per-device MSI domains) - Add a lock contention tracepoint in the queued spinlock slowpath - Fixes for underflow in pseries/powernv msi and pci paths - Switch from legacy-of-mm-gpiochip dependency to platform driver - Fixes for handling TLB misses - Introduce support for powerpc papr-hvpipe - Add vpa-dtl PMU driver for pseries platform - Misc fixes and cleanups Thanks to Aboorva Devarajan, Aditya Bodkhe, Andrew Donnellan, Athira Rajeev, Cédric Le Goater, Christophe Leroy, Erhard Furtner, Gautam Menghani, Geert Uytterhoeven, Haren Myneni, Hari Bathini, Joe Lawrence, Kajol Jain, Kienan Stewart, Linus Walleij, Mahesh Salgaonkar, Nam Cao, Nicolas Schier, Nysal Jan K.A., Ritesh Harjani (IBM), Ruben Wauters, Saket Kumar Bhaskar, Shashank MS, Shrikanth Hegde, Tejas Manhas, Thomas Gleixner, Thomas Huth, Thorsten Blum, Tyrel Datwyler, and Venkat Rao Bagalkote. * tag 'powerpc-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (49 commits) powerpc/pseries: Define __u{8,32} types in papr_hvpipe_hdr struct genirq/msi: Remove msi_post_free() powerpc/perf/vpa-dtl: Add documentation for VPA dispatch trace log PMU powerpc/perf/vpa-dtl: Handle the writing of perf record when aux wake up is needed powerpc/perf/vpa-dtl: Add support to capture DTL data in aux buffer powerpc/perf/vpa-dtl: Add support to setup and free aux buffer for capturing DTL data docs: ABI: sysfs-bus-event_source-devices-vpa-dtl: Document sysfs event format entries for vpa_dtl pmu powerpc/vpa_dtl: Add interface to expose vpa dtl counters via perf powerpc/time: Expose boot_tb via accessor powerpc/32: Remove PAGE_KERNEL_TEXT to fix startup failure powerpc/fprobe: fix updated fprobe for function-graph tracer powerpc/ftrace: support CONFIG_FUNCTION_GRAPH_RETVAL powerpc64/modules: replace stub allocation sentinel with an explicit counter powerpc64/modules: correctly iterate over stubs in setup_ftrace_ool_stubs powerpc/ftrace: ensure ftrace record ops are always set for NOPs powerpc/603: Really copy kernel PGD entries into all PGDIRs powerpc/8xx: Remove left-over instruction and comments in DataStoreTLBMiss handler powerpc/pseries: HVPIPE changes to support migration powerpc/pseries: Enable hvpipe with ibm,set-system-parameter RTAS powerpc/pseries: Enable HVPIPE event message interrupt ...	2025-09-29 19:28:50 -07:00
Linus Torvalds	feafee2845	Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "There's good stuff across the board, including some nice mm improvements for CPUs with the 'noabort' BBML2 feature and a clever patch to allow ptdump to play nicely with block mappings in the vmalloc area. Confidential computing: - Add support for accepting secrets from firmware (e.g. ACPI CCEL) and mapping them with appropriate attributes. CPU features: - Advertise atomic floating-point instructions to userspace - Extend Spectre workarounds to cover additional Arm CPU variants - Extend list of CPUs that support break-before-make level 2 and guarantee not to generate TLB conflict aborts for changes of mapping granularity (BBML2_NOABORT) - Add GCS support to our uprobes implementation. Documentation: - Remove bogus SME documentation concerning register state when entering/exiting streaming mode. Entry code: - Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY) - Micro-optimise syscall entry path with a compiler branch hint. Memory management: - Enable huge mappings in vmalloc space even when kernel page-table dumping is enabled - Tidy up the types used in our early MMU setup code - Rework rodata= for closer parity with the behaviour on x86 - For CPUs implementing BBML2_NOABORT, utilise block mappings in the linear map even when rodata= applies to virtual aliases - Don't re-allocate the virtual region between '_text' and '_stext', as doing so confused tools parsing /proc/vmcore. Miscellaneous: - Clean-up Kconfig menuconfig text for architecture features - Avoid redundant bitmap_empty() during determination of supported SME vector lengths - Re-enable warnings when building the 32-bit vDSO object - Avoid breaking our eggs at the wrong end. Perf and PMUs: - Support for v3 of the Hisilicon L3C PMU - Support for Hisilicon's MN and NoC PMUs - Support for Fujitsu's Uncore PMU - Support for SPE's extended event filtering feature - Preparatory work to enable data source filtering in SPE - Support for multiple lanes in the DWC PCIe PMU - Support for i.MX94 in the IMX DDR PMU driver - MAINTAINERS update (Thank you, Yicong) - Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets). Selftests: - Add basic LSFE check to the existing hwcaps test - Support nolibc in GCS tests - Extend SVE ptrace test to pass unsupported regsets and invalid vector lengths - Minor cleanups (typos, cosmetic changes). System registers: - Fix ID_PFR1_EL1 definition - Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1 - Sync TCR_EL1 definition with the latest Arm ARM (L.b) - Be stricter about the input fed into our AWK sysreg generator script - Typo fixes and removal of redundant definitions. ACPI, EFI and PSCI: - Decouple Arm's "Software Delegated Exception Interface" (SDEI) support from the ACPI GHES code so that it can be used by platforms booted with device-tree - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI runtime calls - Fix a node refcount imbalance in the PSCI device-tree code. CPU Features: - Ensure register sanitisation is applied to fields in ID_AA64MMFR4 - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests can reliably query the underlying CPU types from the VMM - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes to our context-switching, signal handling and ptrace code" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits) arm64: cpufeature: Remove duplicate asm/mmu.h header arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN perf/dwc_pcie: Fix use of uninitialized variable arm/syscalls: mark syscall invocation as likely in invoke_syscall Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU Documentation: hisi-pmu: Fix of minor format error drivers/perf: hisi: Add support for L3C PMU v3 drivers/perf: hisi: Refactor the event configuration of L3C PMU drivers/perf: hisi: Extend the field of tt_core drivers/perf: hisi: Extract the event filter check of L3C PMU drivers/perf: hisi: Simplify the probe process of each L3C PMU version drivers/perf: hisi: Export hisi_uncore_pmu_isr() drivers/perf: hisi: Relax the event ID check in the framework perf: Fujitsu: Add the Uncore PMU driver arm64: map [_text, _stext) virtual address range non-executable+read-only arm64/sysreg: Update TCR_EL1 register arm64: Enable vmalloc-huge with ptdump arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list arm64: errata: Apply workarounds for Neoverse-V3AE arm64: cputype: Add Neoverse-V3AE definitions ...	2025-09-29 18:48:39 -07:00
Linus Torvalds	a5ba183bde	Merge tag 'hardening-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "One notable addition is the creation of the 'transitional' keyword for kconfig so CONFIG renaming can go more smoothly. This has been a long-standing deficiency, and with the renaming of CONFIG_CFI_CLANG to CONFIG_CFI (since GCC will soon have KCFI support), this came up again. The breadth of the diffstat is mainly this renaming. - Clean up usage of TRAILING_OVERLAP() (Gustavo A. R. Silva) - lkdtm: fortify: Fix potential NULL dereference on kmalloc failure (Junjie Cao) - Add str_assert_deassert() helper (Lad Prabhakar) - gcc-plugins: Remove TODO_verify_il for GCC >= 16 - kconfig: Fix BrokenPipeError warnings in selftests - kconfig: Add transitional symbol attribute for migration support - kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI" * tag 'hardening-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: lib/string_choices: Add str_assert_deassert() helper kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI kconfig: Add transitional symbol attribute for migration support kconfig: Fix BrokenPipeError warnings in selftests gcc-plugins: Remove TODO_verify_il for GCC >= 16 stddef: Introduce __TRAILING_OVERLAP() stddef: Remove token-pasting in TRAILING_OVERLAP() lkdtm: fortify: Fix potential NULL dereference on kmalloc failure	2025-09-29 17:48:27 -07:00
Linus Torvalds	a240a79d43	Merge tag 'seccomp-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp update from Kees Cook: - Fix race with WAIT_KILLABLE_RECV (Johannes Nixdorf) * tag 'seccomp-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: selftests/seccomp: Add a test for the WAIT_KILLABLE_RECV fast reply race seccomp: Fix a race with WAIT_KILLABLE_RECV if the tracer replies too fast	2025-09-29 17:44:09 -07:00
Linus Torvalds	449c2b302c	Merge tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async directory updates from Christian Brauner: "This contains further preparatory changes for the asynchronous directory locking scheme: - Add lookup_one_positive_killable() which allows overlayfs to perform lookup that won't block on a fatal signal - Unify the mount idmap handling in struct renamedata as a rename can only happen within a single mount - Introduce kern_path_parent() for audit which sets the path to the parent and returns a dentry for the target without holding any locks on return - Rename kern_path_locked() as it is only used to prepare for the removal of an object from the filesystem: kern_path_locked() => start_removing_path() kern_path_create() => start_creating_path() user_path_create() => start_creating_user_path() user_path_locked_at() => start_removing_user_path_at() done_path_create() => end_creating_path() NA => end_removing_path()" * tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: debugfs: rename start_creating() to debugfs_start_creating() VFS: rename kern_path_locked() and related functions. VFS/audit: introduce kern_path_parent() for audit VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata VFS: discard err2 in filename_create() VFS/ovl: add lookup_one_positive_killable()	2025-09-29 11:55:15 -07:00
Linus Torvalds	18b19abc37	Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains a larger set of changes around the generic namespace infrastructure of the kernel. Each specific namespace type (net, cgroup, mnt, ...) embedds a struct ns_common which carries the reference count of the namespace and so on. We open-coded and cargo-culted so many quirks for each namespace type that it just wasn't scalable anymore. So given there's a bunch of new changes coming in that area I've started cleaning all of this up. The core change is to make it possible to correctly initialize every namespace uniformly and derive the correct initialization settings from the type of the namespace such as namespace operations, namespace type and so on. This leaves the new ns_common_init() function with a single parameter which is the specific namespace type which derives the correct parameters statically. This also means the compiler will yell as soon as someone does something remotely fishy. The ns_common_init() addition also allows us to remove ns_alloc_inum() and drops any special-casing of the initial network namespace in the network namespace initialization code that Linus complained about. Another part is reworking the reference counting. The reference counting was open-coded and copy-pasted for each namespace type even though they all followed the same rules. This also removes all open accesses to the reference count and makes it private and only uses a very small set of dedicated helpers to manipulate them just like we do for e.g., files. In addition this generalizes the mount namespace iteration infrastructure introduced a few cycles ago. As reminder, the vfs makes it possible to iterate sequentially and bidirectionally through all mount namespaces on the system or all mount namespaces that the caller holds privilege over. This allow userspace to iterate over all mounts in all mount namespaces using the listmount() and statmount() system call. Each mount namespace has a unique identifier for the lifetime of the systems that is exposed to userspace. The network namespace also has a unique identifier working exactly the same way. This extends the concept to all other namespace types. The new nstree type makes it possible to lookup namespaces purely by their identifier and to walk the namespace list sequentially and bidirectionally for all namespace types, allowing userspace to iterate through all namespaces. Looking up namespaces in the namespace tree works completely locklessly. This also means we can move the mount namespace onto the generic infrastructure and remove a bunch of code and members from struct mnt_namespace itself. There's a bunch of stuff coming on top of this in the future but for now this uses the generic namespace tree to extend a concept introduced first for pidfs a few cycles ago. For a while now we have supported pidfs file handles for pidfds. This has proven to be very useful. This extends the concept to cover namespaces as well. It is possible to encode and decode namespace file handles using the common name_to_handle_at() and open_by_handle_at() apis. As with pidfs file handles, namespace file handles are exhaustive, meaning it is not required to actually hold a reference to nsfs in able to decode aka open_by_handle_at() a namespace file handle. Instead the FD_NSFS_ROOT constant can be passed which will let the kernel grab a reference to the root of nsfs internally and thus decode the file handle. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them trivially. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. The namespace file handle layout is exposed as uapi and has a stable and extensible format. For now it simply contains the namespace identifier, the namespace type, and the inode number. The stable format means that userspace may construct its own namespace file handles without going through name_to_handle_at() as they are already allowed for pidfs and cgroup file handles" * tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits) ns: drop assert ns: move ns type into struct ns_common nstree: make struct ns_tree private ns: add ns_debug() ns: simplify ns_common_init() further cgroup: add missing ns_common include ns: use inode initializer for initial namespaces selftests/namespaces: verify initial namespace inode numbers ns: rename to __ns_ref nsfs: port to ns_ref_() helpers net: port to ns_ref_() helpers uts: port to ns_ref_() helpers ipv4: use check_net() net: use check_net() net-sysfs: use check_net() user: port to ns_ref_() helpers time: port to ns_ref_() helpers pid: port to ns_ref_() helpers ipc: port to ns_ref_() helpers cgroup: port to ns_ref_() helpers ...	2025-09-29 11:20:29 -07:00
Linus Torvalds	722df25ddf	Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_process updates from Christian Brauner: "This contains the changes to enable support for clone3() on nios2 which apparently is still a thing. The more exciting part of this is that it cleans up the inconsistency in how the 64-bit flag argument is passed from copy_process() into the various other copy_() helpers" [ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ] tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nios2: implement architecture-specific portion of sys_clone3 arch: copy_thread: pass clone_flags as u64 copy_process: pass clone_flags as u64 across calltree copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)	2025-09-29 10:36:50 -07:00
Linus Torvalds	e571372101	Merge tag 'vfs-6.18-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: "This just contains a few changes to pid_nr_ns() to make it more robust and cleans up or improves a few users that ab- or misuse it currently" * tag 'vfs-6.18-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pid: change task_state() to use task_ppid_nr_ns() pid: change bacct_add_tsk() to use task_ppid_nr_ns() pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers pid: Add a judgment for ns null in pid_nr_ns	2025-09-29 10:02:35 -07:00
Linus Torvalds	b7ce6fa90f	Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...	2025-09-29 09:03:07 -07:00
Linus Torvalds	8f9736633f	Merge tag 'trace-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix buffer overflow in osnoise_cpu_write() The allocated buffer to read user space did not add a nul terminating byte after copying from user the string. It then reads the string, and if user space did not add a nul byte, the read will continue beyond the string. Add a nul terminating byte after reading the string. - Fix missing check for lockdown on tracing There's a path from kprobe events or uprobe events that can update the tracing system even if lockdown on tracing is activate. Add a check in the dynamic event path. - Add a recursion check for the function graph return path Now that fprobes can hook to the function graph tracer and call different code between the entry and the exit, the exit code may now call functions that are not called in entry. This means that the exit handler can possibly trigger recursion that is not caught and cause the system to crash. Add the same recursion checks in the function exit handler as exists in the entry handler path. * tag 'trace-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: fgraph: Protect return handler from recursion loop tracing: dynevent: Add a missing lockdown check on dynevent tracing/osnoise: Fix slab-out-of-bounds in _parse_integer_limit()	2025-09-28 10:26:35 -07:00
Masami Hiramatsu (Google)	0db0934e7f	tracing: fgraph: Protect return handler from recursion loop function_graph_enter_regs() prevents itself from recursion by ftrace_test_recursion_trylock(), but __ftrace_return_to_handler(), which is called at the exit, does not prevent such recursion. Therefore, while it can prevent recursive calls from fgraph_ops::entryfunc(), it is not able to prevent recursive calls to fgraph from fgraph_ops::retfunc(), resulting in a recursive loop. This can lead an unexpected recursion bug reported by Menglong. is_endbr() is called in __ftrace_return_to_handler -> fprobe_return -> kprobe_multi_link_exit_handler -> is_endbr. To fix this issue, acquire ftrace_test_recursion_trylock() in the __ftrace_return_to_handler() after unwind the shadow stack to mark this section must prevent recursive call of fgraph inside user-defined fgraph_ops::retfunc(). This is essentially a fix to commit `4346ba1604` ("fprobe: Rewrite fprobe on function-graph tracer"), because before that fgraph was only used from the function graph tracer. Fprobe allowed user to run any callbacks from fgraph after that commit. Reported-by: Menglong Dong <menglong8.dong@gmail.com> Closes: https://lore.kernel.org/all/20250918120939.1706585-1-dongml2@chinatelecom.cn/ Fixes: `4346ba1604` ("fprobe: Rewrite fprobe on function-graph tracer") Cc: stable@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/175852292275.307379.9040117316112640553.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Menglong Dong <menglong8.dong@gmail.com> Acked-by: Menglong Dong <menglong8.dong@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-09-27 09:04:05 -04:00
Linus Torvalds	083fc6d7fa	Merge tag 'sched-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix two dl_server regressions: a race that can end up leaving the dl_server stuck, and a dl_server throttling bug causing lag to fair tasks" * tag 'sched-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix dl_server behaviour sched/deadline: Fix dl_server getting stuck	2025-09-26 12:30:23 -07:00
Linus Torvalds	2cea0ed979	Merge tag 'locking-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Fix a PI-futexes race, and fix a copy_process() futex cleanup bug" * tag 'locking-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Use correct exit on failure from futex_hash_allocate_default() futex: Prevent use-after-free during requeue-PI	2025-09-26 12:28:32 -07:00
Linus Torvalds	93a2744561	Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio fixes from Michael Tsirkin: "virtio,vhost: last minute fixes More small fixes. Most notably this fixes crashes and hangs in vhost-net" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: MAINTAINERS, mailmap: Update address for Peter Hilber virtio_config: clarify output parameters uapi: vduse: fix typo in comment vhost: Take a reference on the task in struct vhost_task. vhost-net: flush batched before enabling notifications Revert "vhost/net: Defer TX queue re-enable until after sendmsg" vhost-net: unbreak busy polling vhost-scsi: fix argument order in tport allocation error message	2025-09-25 08:06:03 -07:00
Menglong Dong	378b770819	sched: Make migrate_{en,dis}able() inline For now, migrate_enable and migrate_disable are global, which makes them become hotspots in some case. Take BPF for example, the function calling to migrate_enable and migrate_disable in BPF trampoline can introduce significant overhead, and following is the 'perf top' of FENTRY's benchmark (./tools/testing/selftests/bpf/bench trig-fentry): 54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k] bpf_prog_2dcccf652aac1793_bench_trigger_fentry 10.43% [kernel] [k] migrate_enable 10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037 8.06% [kernel] [k] __bpf_prog_exit_recur 4.11% libc.so.6 [.] syscall 2.15% [kernel] [k] entry_SYSCALL_64 1.48% [kernel] [k] memchr_inv 1.32% [kernel] [k] fput 1.16% [kernel] [k] _copy_to_user 0.73% [kernel] [k] bpf_prog_test_run_raw_tp So in this commit, we make migrate_enable/migrate_disable inline to obtain better performance. The struct rq is defined internally in kernel/sched/sched.h, and the field "nr_pinned" is accessed in migrate_enable/migrate_disable, which makes it hard to make them inline. Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1], so we can define the migrate_enable/migrate_disable in include/linux/sched.h and access "this_rq()->nr_pinned" with "(void *)this_rq() + RQ_nr_pinned". The offset of "nr_pinned" is generated in include/generated/rq-offsets.h by kernel/sched/rq-offsets.c. Generally speaking, we move the definition of migrate_enable and migrate_disable to include/linux/sched.h from kernel/sched/core.c. The calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable(). The "struct rq" is not available in include/linux/sched.h, so we can't access the "runqueues" with this_cpu_ptr(), as the compilation will fail in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr(): typeof((ptr) + 0) So we introduce the this_rq_raw() and access the runqueues with arch_raw_cpu_ptr/PERCPU_PTR directly. The variable "runqueues" is not visible in the kernel modules, and export it is not a good idea. As Peter Zijlstra advised in [2], we define and export migrate_enable/migrate_disable in kernel/sched/core.c too, and use them for the modules. Before this patch, the performance of BPF FENTRY is: fentry : 113.030 ± 0.149M/s fentry : 112.501 ± 0.187M/s fentry : 112.828 ± 0.267M/s fentry : 115.287 ± 0.241M/s After this patch, the performance of BPF FENTRY increases to: fentry : 143.644 ± 0.670M/s fentry : 149.764 ± 0.362M/s fentry : 149.642 ± 0.156M/s fentry : 145.263 ± 0.221M/s Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]	2025-09-25 09:57:16 +02:00
Peter Zijlstra	a3a70caf79	sched/deadline: Fix dl_server behaviour John reported undesirable behaviour with the dl_server since commit: `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling"). When starving fair tasks on purpose (starting spinning FIFO tasks), his fair workload, which often goes (briefly) idle, would delay fair invocations for a second, running one invocation per second was both unexpected and terribly slow. The reason this happens is that when dl_se->server_pick_task() returns NULL, indicating no runnable tasks, it would yield, pushing any later jobs out a whole period (1 second). Instead simply stop the server. This should restore behaviour in that a later wakeup (which restarts the server) will be able to continue running (subject to the CBS wakeup rules). Notably, this does not re-introduce the behaviour `cccb45d7c4` set out to solve, any start/stop cycle is naturally throttled by the timer period (no active cancel). Fixes: `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: John Stultz <jstultz@google.com>	2025-09-25 09:51:50 +02:00
Peter Zijlstra	4ae8d9aa9f	sched/deadline: Fix dl_server getting stuck John found it was easy to hit lockup warnings when running locktorture on a 2 CPU VM, which he bisected down to: commit `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling"). While debugging it seems there is a chance where we end up with the dl_server dequeued, with dl_se->dl_server_active. This causes dl_server_start() to return without enqueueing the dl_server, thus it fails to run when RT tasks starve the cpu. When this happens, dl_server_timer() catches the '!dl_se->server_has_tasks(dl_se)' case, which then calls replenish_dl_entity() and dl_server_stopped() and finally return HRTIMER_NO_RESTART. This ends in no new timer and also no enqueue, leaving the dl_server 'dead', allowing starvation. What should have happened is for the bandwidth timer to start the zero-laxity timer, which in turn would enqueue the dl_server and cause dl_se->server_pick_task() to be called -- which will stop the dl_server if no fair tasks are observed for a whole period. IOW, it is totally irrelevant if there are fair tasks at the moment of bandwidth refresh. This removes all dl_se->server_has_tasks() users, so remove the whole thing. Fixes: `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: John Stultz <jstultz@google.com>	2025-09-25 09:51:50 +02:00
Christian Brauner	af075603f2	ns: drop assert Otherwise we warn when e.g., no namespaces are configured but the initial namespace for is still around. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-25 09:23:54 +02:00
Christian Brauner	4055526d35	ns: move ns type into struct ns_common It's misplaced in struct proc_ns_operations and ns->ops might be NULL if the namespace is compiled out but we still want to know the type of the namespace for the initial namespace struct. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-25 09:23:54 +02:00
Christian Brauner	10cdfcd37a	nstree: make struct ns_tree private Don't expose it directly. There's no need to do that. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-25 09:23:47 +02:00
Linus Torvalds	bf40f4b877	Merge tag 'probes-fixes-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - fprobe: Even if there is a memory allocation failure, try to remove the addresses recorded until then from the filter. Previously we just skipped it. - tracing: dynevent: Add a missing lockdown check on dynevent. This dynevent is the interface for all probe events. Thus if there is no check, any probe events can be added after lock down the tracefs. * tag 'probes-fixes-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: dynevent: Add a missing lockdown check on dynevent tracing: fprobe: Fix to remove recorded module addresses from filter	2025-09-24 19:17:07 -07:00
Kees Cook	23ef9d4397	kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI The kernel's CFI implementation uses the KCFI ABI specifically, and is not strictly tied to a particular compiler. In preparation for GCC supporting KCFI, rename CONFIG_CFI_CLANG to CONFIG_CFI (along with associated options). Use new "transitional" Kconfig option for old CONFIG_CFI_CLANG that will enable CONFIG_CFI during olddefconfig. Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20250923213422.1105654-3-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>	2025-09-24 14:29:14 -07:00
Will Deacon	4e4e36dce3	Merge branch 'for-next/uprobes' into for-next/core * for-next/uprobes: arm64: probes: Fix incorrect bl/blr address and register usage uprobes: uprobe_warn should use passed task arm64: Kconfig: Remove GCS restrictions on UPROBES arm64: uprobes: Add GCS support to uretprobes arm64: probes: Add GCS support to bl/blr/ret arm64: uaccess: Add additional userspace GCS accessors arm64: uaccess: Move existing GCS accessors definitions to gcs.h arm64: probes: Break ret out from bl/blr	2025-09-24 16:35:06 +01:00
Masami Hiramatsu (Google)	456c32e3c4	tracing: dynevent: Add a missing lockdown check on dynevent Since dynamic_events interface on tracefs is compatible with kprobe_events and uprobe_events, it should also check the lockdown status and reject if it is set. Link: https://lore.kernel.org/all/175824455687.45175.3734166065458520748.stgit@devnote2/ Fixes: `17911ff38a` ("tracing: Add locked_down checks to the open calls of files created for tracefs") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org	2025-09-25 00:22:46 +09:00
Masami Hiramatsu (Google)	c539feff3c	tracing: fprobe: Fix to remove recorded module addresses from filter Even if there is a memory allocation failure in fprobe_addr_list_add(), there is a partial list of module addresses. So remove the recorded addresses from filter if exists. This also removes the redundant ret local variable. Fixes: `a3dc2983ca` ("tracing: fprobe: Cleanup fprobe hash when module unloading") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org Reviewed-by: Menglong Dong <menglong8.dong@gmail.com>	2025-09-24 23:18:26 +09:00
Sebastian Andrzej Siewior	4ec3c15462	futex: Use correct exit on failure from futex_hash_allocate_default() copy_process() uses the wrong error exit path from futex_hash_allocate_default(). After exiting from futex_hash_allocate_default(), neither tasklist_lock nor siglock has been acquired. The exit label bad_fork_core_free unlocks both of these locks which is wrong. The next exit label, bad_fork_cancel_cgroup, is the correct exit. sched_cgroup_fork() did not allocate any resources that need to freed. Use bad_fork_cancel_cgroup on error exit from futex_hash_allocate_default(). Fixes: `7c4f75a21f` ("futex: Allow automatic allocation of process wide futex hash") Reported-by: syzbot+80cb3cc5c14fad191a10@syzkaller.appspotmail.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/68cb1cbd.050a0220.2ff435.0599.GAE@google.com	2025-09-24 09:20:02 +02:00
Tejun Heo	df10932ad7	Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()" This reverts commit `c8191ee8e6` which triggers the following suspicious RCU usage warning: [ 6.647598] ============================= [ 6.647603] WARNING: suspicious RCU usage [ 6.647605] 6.17.0-rc7-virtme #1 Not tainted [ 6.647608] ----------------------------- [ 6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage! [ 6.647610] [ 6.647610] other info that might help us debug this: [ 6.647610] [ 6.647612] [ 6.647612] rcu_scheduler_active = 2, debug_locks = 1 [ 6.647613] 1 lock held by swapper/10/0: [ 6.647614] #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at: +raw_spin_rq_lock_nested+0x20/0x90 [ 6.647630] [ 6.647630] stack backtrace: [ 6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1 +PREEMPT(full) [ 6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all) [ 6.647648] Call Trace: [ 6.647652] <IRQ> [ 6.647655] dump_stack_lvl+0x78/0xe0 [ 6.647665] lockdep_rcu_suspicious+0x14a/0x1b0 [ 6.647672] __rhashtable_lookup.constprop.0+0x1d5/0x250 [ 6.647680] find_dsq_for_dispatch+0xbc/0x190 [ 6.647684] do_enqueue_task+0x25b/0x550 [ 6.647689] enqueue_task_scx+0x21d/0x360 [ 6.647692] ? trace_lock_acquire+0x22/0xb0 [ 6.647695] enqueue_task+0x2e/0xd0 [ 6.647698] ttwu_do_activate+0xa2/0x290 [ 6.647703] sched_ttwu_pending+0xfd/0x250 [ 6.647706] __flush_smp_call_function_queue+0x1cd/0x610 [ 6.647714] __sysvec_call_function_single+0x34/0x150 [ 6.647720] sysvec_call_function_single+0x6e/0x80 [ 6.647726] </IRQ> [ 6.647726] <TASK> [ 6.647727] asm_sysvec_call_function_single+0x1a/0x20 Reported-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 20:38:23 -10:00
Tejun Heo	ebfd5226ec	sched_ext: Merge branch 'for-6.17-fixes' into for-6.18 Pull sched_ext/for-6.17-fixes to receive: `55ed11b181` ("sched_ext: idle: Handle migration-disabled tasks in BPF code") which conflicts with the following commit in for-6.18: `2407bae23d` ("sched_ext: Add the @sch parameter to ext_idle helpers") The conflict is a simple context conflict which can be resolved by taking the updated parts from both commits. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:10:20 -10:00
Tejun Heo	c0008a5632	sched_ext: Misc updates around scx_sched instance pointer In preparation for multiple scheduler support: - Add the @sch parameter to find_global_dsq() and refill_task_slice_dfl(). - Restructure scx_allow_ttwu_queue() and make it read scx_root into $sch. - Make RCU protection in scx_dsq_move() and scx_bpf_dsq_move_to_local() explicit. v2: Add scx_root -> sch conversion in scx_allow_ttwu_queue(). Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	d4f7d86666	sched_ext: Drop scx_kf_exit() and scx_kf_error() The intention behind scx_kf_exit/error() was that when called from kfuncs, scx_kf_exit/error() would be able to implicitly determine the scx_sched instance being operated on and thus wouldn't need the @sch parameter passed in explicitly. This turned out to be unnecessarily complicated to implement and not have enough practical benefits. Replace scx_kf_exit/error() usages with scx_exit/error() which take an explicit @sch parameter. - Add the @sch parameter to scx_kf_allowed(), scx_kf_allowed_on_arg_tasks, mark_direct_dispatch() and other intermediate functions transitively. - In callers that don't already have @sch available, grab RCU, read $scx_root, verify it's not NULL and use it. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	4d9553fee3	sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit() In preparation for multiple scheduler support, add the @sch parameter to scx_dsq_insert_preamble/commit() and update the callers to read $scx_root and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	956f2b11a8	sched_ext: Drop kf_cpu_valid() The intention behind kf_cpu_valid() was that when called from kfuncs, kf_cpu_valid() would be able to implicitly determine the scx_sched instance being operated on and thus wouldn't need @sch passed in explicitly. This turned out to be unnecessarily complicated to implement and not have justifiable practical benefits. Replace kf_cpu_valid() usages with ops_cpu_valid() which takes explicit @sch. Callers which don't have $sch available in the context are updated to read $scx_root under RCU read lock, verify that it's not NULL and pass it in. scx_bpf_cpu_rq() is restructured to use guard(rcu)() instead of explicit rcu_read_[un]lock(). Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	2407bae23d	sched_ext: Add the @sch parameter to ext_idle helpers In preparation for multiple scheduler support, add the @sch parameter to validate_node(), check_builtin_idle_enabled() and select_cpu_from_kfunc(), and update their callers to read $scx_root, verify that it's not NULL and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	fc6a93aa62	sched_ext: Add the @sch parameter to __bstr_format() In preparation for multiple scheduler support, add the @sch parameter to __bstr_format() and update the callers to read $scx_root, verify that it's not NULL and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	9fc687edf2	sched_ext: Separate out scx_kick_cpu() and add @sch to it In preparation for multiple scheduler support, separate out scx_kick_cpu() from scx_bpf_kick_cpu() and add the @sch parameter to it. scx_bpf_kick_cpu() now acquires an RCU read lock, reads $scx_root, and calls scx_kick_cpu() with it if non-NULL. The passed in @sch parameter is not used yet. Internal uses of scx_bpf_kick_cpu() are converted to scx_kick_cpu(). Where $sch is available, it's used. In the pick_task_scx() path where no associated scheduler can be identified, $scx_root is used directly. Note that $scx_root cannot be NULL in this case. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	f3aec2adce	sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() ops.exit() may be called even if the loading failed before ops.init() finishes successfully. This is because ops.exit() allows rich exit info communication. Add SCX_EFLAG_INITIALIZED flag to scx_exit_info.flags to indicate whether ops.init() finished successfully. This enables BPF schedulers to distinguish between exit scenarios and handle cleanup appropriately based on initialization state. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	c7e739746d	sched_ext: Use bitfields for boolean warning flags Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to single-bit bitfields. While this doesn't reduce struct size immediately, it prepares for future bitfield additions. v2: Update patch description. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	f75efc8f4c	sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() task_can_run_on_remote_rq() takes @sch but it is using scx_root when incrementing SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which is inconsistent and gets in the way of implementing multiple scheduler support. Use @sch instead. As currently scx_root is the only possible scheduler instance, this doesn't cause any behavior changes. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	c8191ee8e6	sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() The find_user_dsq() function is called from contexts that are already under RCU read lock protection. Switch from rhashtable_lookup_fast() to rhashtable_lookup() to avoid redundant RCU locking. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:25 -10:00
Andrea Righi	340de1f673	sched_ext: Verify RCU protection in scx_bpf_cpu_curr() scx_bpf_cpu_curr() has been introduced to retrieve the current task of a given runqueue, allowing schedulers to interact with that task. The kfunc assumes that it is always called in an RCU context, but this is not always guaranteed and some BPF schedulers can trigger the following warning: WARNING: suspicious RCU usage sched_ext: BPF scheduler "cosmos_1.0.2_gd0e71ca_x86_64_unknown_linux_gnu_debug" enabled 6.17.0-rc1 #1-NixOS Not tainted ----------------------------- kernel/sched/ext.c:6415 suspicious rcu_dereference_check() usage! ... Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x4e/0x96 scx_bpf_cpu_curr+0x7e/0x80 bpf_prog_c68b2b6b6b1b0ff8_sched_timerfn+0xce/0x1dc bpf_timer_cb+0x7b/0x130 __hrtimer_run_queues+0x1ea/0x380 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xc9/0x3b0 __irq_exit_rcu+0x96/0xc0 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x73/0x80 </IRQ> <TASK> To address this, mark the kfunc with KF_RCU_PROTECTED, so the verifier can enforce its usage only inside RCU-protected sections. Note: this also requires commit `1512231b6c` ("bpf: Enforce RCU protection for KF_RCU_PROTECTED"), currently in bpf-next, to enforce the proper KF_RCU_PROTECTED. Fixes: `20b158094a` ("sched_ext: Introduce scx_bpf_cpu_curr()") Cc: Christian Loehle <christian.loehle@arm.com> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 05:09:40 -10:00

1 2 3 4 5 ...

49286 Commits