linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-03-21 23:16:50 +08:00

Author	SHA1	Message	Date
Paolo Bonzini	94fe3e6515	Merge tag 'kvm-x86-generic-7.0-rc3' of https://github.com/kvm-x86/linux into HEAD KVM generic changes for 7.0 - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being unnecessary and confusing, triggered compiler warnings due to -Wflex-array-member-not-at-end. - Document that vcpu->mutex is take outside of kvm->slots_lock and kvm->slots_arch_lock, which is intentional and desirable despite being rather unintuitive.	2026-03-11 18:01:55 +01:00
Paolo Bonzini	40c2ffcac0	Merge tag 'kvm-riscv-fixes-7.0-1' of https://github.com/kvm-riscv/linux into HEAD KVM/riscv fixes for 7.0, take #1 - Prevent speculative out-of-bounds access using array_index_nospec() in APLIC interrupt handling, ONE_REG regiser access, AIA CSR access, float register access, and PMU counter access - Fix potential use-after-free issues in kvm_riscv_gstage_get_leaf(), kvm_riscv_aia_aplic_has_attr(), and kvm_riscv_aia_imsic_has_attr() - Fix potential null pointer dereference in kvm_riscv_vcpu_aia_rmw_topei() - Fix off-by-one array access in SBI PMU - Skip THP support check during dirty logging - Fix error code returned for Smstateen and Ssaia ONE_REG interface - Check host Ssaia extension when creating AIA irqchip	2026-03-11 18:01:03 +01:00
Paolo Bonzini	de353e3fcc	Merge tag 'kvmarm-fixes-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.0, take #2 - Fix a couple of low-severity bugs in our S2 fault handling path, affecting the recently introduced LS64 handling and the even more esoteric handling of hwpoison in a nested context - Address yet another syzkaller finding in the vgic initialisation, were we would end-up destroying an uninitialised vgic, with nasty consequences - Address an annoying case of pKVM failing to boot when some of the memblock regions that the host is faulting in are not page-aligned - Inject some sanity in the NV stage-2 walker by checking the limits against the advertised PA size, and correctly report the resulting faults - Drop an unnecessary ISB when emulating an EL2 S1 address translation	2026-03-11 18:00:54 +01:00
Zenghui Yu (Huawei)	3599c714c0	KVM: arm64: Remove the redundant ISB in __kvm_at_s1e2() We already have an ISB in __kvm_at() to make the address translation result visible to subsequent reads of PAR_EL1. Remove the redundant one right after it. Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Link: https://patch.msgid.link/20260306074422.47694-1-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-06 10:42:21 +00:00
Fuad Tabba	244acf1976	KVM: arm64: Fix vma_shift staleness on nested hwpoison path When user_mem_abort() handles a nested stage-2 fault, it truncates vma_pagesize to respect the guest's mapping size. However, the local variable vma_shift is never updated to match this new size. If the underlying host page turns out to be hardware poisoned, kvm_send_hwpoison_signal() is called with the original, larger vma_shift instead of the actual mapping size. This signals incorrect poison boundaries to userspace and breaks hugepage memory poison containment for nested VMs. Update vma_shift to match the truncated vma_pagesize when operating on behalf of a nested hypervisor. Fixes: `fd276e71d1` ("KVM: arm64: nv: Handle shadow stage 2 page faults") Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260304162222.836152-3-tabba@google.com [maz: simplified vma_shift assignment from the original patch] Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-06 10:42:02 +00:00
Anup Patel	c61ec3e8cc	RISC-V: KVM: Check host Ssaia extension when creating AIA irqchip The KVM user-space may create KVM AIA irqchip before checking VCPU Ssaia extension availability so KVM AIA irqchip must fail when host does not have Ssaia extension. Fixes: `89d01306e3` ("RISC-V: KVM: Implement device interface for AIA irqchip") Signed-off-by: Anup Patel <anup.patel@oss.qualcomm.com> Link: https://lore.kernel.org/r/20260120080013.2153519-4-anup.patel@oss.qualcomm.com Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Anup Patel	24433b2b5c	RISC-V: KVM: Fix error code returned for Ssaia ONE_REG Return -ENOENT for Ssaia ONE_REG when Ssaia is not enabled for a VCPU. This will make Ssaia ONE_REG error codes consistent with other ONE_REG interfaces of KVM RISC-V. Fixes: `2a88f38cd5` ("RISC-V: KVM: return ENOENT in *_one_reg() when reg is unknown") Signed-off-by: Anup Patel <anup.patel@oss.qualcomm.com> Link: https://lore.kernel.org/r/20260120080013.2153519-3-anup.patel@oss.qualcomm.com Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Anup Patel	45700a743a	RISC-V: KVM: Fix error code returned for Smstateen ONE_REG Return -ENOENT for Smstateen ONE_REG when: 1) Smstateen is not enabled for a VCPU 2) ONE_REG id is out of range This will make Smstateen ONE_REG error codes consistent with other ONE_REG interfaces of KVM RISC-V. Fixes: `c04913f2b5` ("RISCV: KVM: Add sstateen0 to ONE_REG") Signed-off-by: Anup Patel <anup.patel@oss.qualcomm.com> Link: https://lore.kernel.org/r/20260120080013.2153519-2-anup.patel@oss.qualcomm.com Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Lukas Gerlach	2dda6a9e09	KVM: riscv: Fix Spectre-v1 in PMU counter access Guest-controlled counter indices received via SBI ecalls are used to index into the PMC array. Sanitize them with array_index_nospec() to prevent speculative out-of-bounds access. Similar to x86 commit `13c5183a4e` ("KVM: x86: Protect MSR-based index computations in pmu.h from Spectre-v1/L1TF attacks"). Fixes: `8f0153ecd3` ("RISC-V: KVM: Add skeleton support for perf") Reviewed-by: Radim Krčmář <radim.krcmar@oss.qualcomm.com> Signed-off-by: Lukas Gerlach <lukas.gerlach@cispa.de> Link: https://lore.kernel.org/r/20260303-kvm-riscv-spectre-v1-v2-4-192caab8e0dc@cispa.de Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Lukas Gerlach	8f0c15c4b1	KVM: riscv: Fix Spectre-v1 in floating-point register access User-controlled indices are used to index into floating-point registers. Sanitize them with array_index_nospec() to prevent speculative out-of-bounds access. Reviewed-by: Radim Krčmář <radim.krcmar@oss.qualcomm.com> Signed-off-by: Lukas Gerlach <lukas.gerlach@cispa.de> Link: https://lore.kernel.org/r/20260303-kvm-riscv-spectre-v1-v2-3-192caab8e0dc@cispa.de Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Lukas Gerlach	ec87a82ca8	KVM: riscv: Fix Spectre-v1 in AIA CSR access User-controlled indices are used to access AIA CSR registers. Sanitize them with array_index_nospec() to prevent speculative out-of-bounds access. Similar to x86 commit `8c86405f60` ("KVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacks") and arm64 commit `41b87599c7` ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()"). Reviewed-by: Radim Krčmář <radim.krcmar@oss.qualcomm.com> Signed-off-by: Lukas Gerlach <lukas.gerlach@cispa.de> Link: https://lore.kernel.org/r/20260303-kvm-riscv-spectre-v1-v2-2-192caab8e0dc@cispa.de Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Lukas Gerlach	f9e26fc325	KVM: riscv: Fix Spectre-v1 in ONE_REG register access User-controlled register indices from the ONE_REG ioctl are used to index into arrays of register values. Sanitize them with array_index_nospec() to prevent speculative out-of-bounds access. Reviewed-by: Radim Krčmář <radim.krcmar@oss.qualcomm.com> Signed-off-by: Lukas Gerlach <lukas.gerlach@cispa.de> Link: https://lore.kernel.org/r/20260303-kvm-riscv-spectre-v1-v2-1-192caab8e0dc@cispa.de Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Wang Yechao	b342166cbc	RISC-V: KVM: Skip THP support check during dirty logging When dirty logging is enabled, guest stage mappings are forced to PAGE_SIZE granularity. Changing the mapping page size at this point is incorrect. Fixes: `ed7ae7a34b` ("RISC-V: KVM: Transparent huge page support") Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20260226191231140_X1Juus7s2kgVlc0ZyW_K@zte.com.cn Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Jiakai Xu	7120a9d9e0	RISC-V: KVM: Fix potential UAF in kvm_riscv_aia_imsic_has_attr() The KVM_DEV_RISCV_AIA_GRP_APLIC branch of aia_has_attr() was identified to have a race condition with concurrent KVM_SET_DEVICE_ATTR ioctls, leading to a use-after-free bug. Upon analyzing the code, it was discovered that the KVM_DEV_RISCV_AIA_GRP_IMSIC branch of aia_has_attr() suffers from the same lack of synchronization. It invokes kvm_riscv_aia_imsic_has_attr() without holding dev->kvm->lock. While aia_has_attr() is running, a concurrent aia_set_attr() could call aia_init() under the dev->kvm->lock. If aia_init() fails, it may trigger kvm_riscv_vcpu_aia_imsic_cleanup(), which frees imsic_state. Without proper locking, kvm_riscv_aia_imsic_has_attr() could attempt to access imsic_state while it is being deallocated. Although this specific path has not yet been reported by a fuzzer, it is logically identical to the APLIC issue. Fix this by acquiring the dev->kvm->lock before calling kvm_riscv_aia_imsic_has_attr(), ensuring consistency with the locking pattern used for other AIA attribute groups. Fixes: `5463091a51` ("RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip") Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn> Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20260304080804.2281721-1-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Jiakai Xu	721ead7757	RISC-V: KVM: Fix use-after-free in kvm_riscv_aia_aplic_has_attr() Fuzzer reports a KASAN use-after-free bug triggered by a race between KVM_HAS_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls on the AIA device. The root cause is that aia_has_attr() invokes kvm_riscv_aia_aplic_has_attr() without holding dev->kvm->lock, while a concurrent aia_set_attr() may call aia_init() under that lock. When aia_init() fails after kvm_riscv_aia_aplic_init() has succeeded, it calls kvm_riscv_aia_aplic_cleanup() in its fail_cleanup_imsics path, which frees both aplic_state and aplic_state->irqs. The concurrent has_attr path can then dereference the freed aplic->irqs in aplic_read_pending(): irqd = &aplic->irqs[irq]; /* UAF here */ KASAN report: BUG: KASAN: slab-use-after-free in aplic_read_pending arch/riscv/kvm/aia_aplic.c:119 [inline] BUG: KASAN: slab-use-after-free in aplic_read_pending_word arch/riscv/kvm/aia_aplic.c:351 [inline] BUG: KASAN: slab-use-after-free in aplic_mmio_read_offset arch/riscv/kvm/aia_aplic.c:406 Read of size 8 at addr ff600000ba965d58 by task 9498 Call Trace: aplic_read_pending arch/riscv/kvm/aia_aplic.c:119 [inline] aplic_read_pending_word arch/riscv/kvm/aia_aplic.c:351 [inline] aplic_mmio_read_offset arch/riscv/kvm/aia_aplic.c:406 kvm_riscv_aia_aplic_has_attr arch/riscv/kvm/aia_aplic.c:566 aia_has_attr arch/riscv/kvm/aia_device.c:469 allocated by task 9473: kvm_riscv_aia_aplic_init arch/riscv/kvm/aia_aplic.c:583 aia_init arch/riscv/kvm/aia_device.c:248 [inline] aia_set_attr arch/riscv/kvm/aia_device.c:334 freed by task 9473: kvm_riscv_aia_aplic_cleanup arch/riscv/kvm/aia_aplic.c:644 aia_init arch/riscv/kvm/aia_device.c:292 [inline] aia_set_attr arch/riscv/kvm/aia_device.c:334 Fix this race by acquiring dev->kvm->lock in aia_has_attr() before calling kvm_riscv_aia_aplic_has_attr(), consistent with the locking pattern used in aia_get_attr() and aia_set_attr(). Fixes: `289a007b98` ("RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip") Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com> Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20260302132703.1721415-1-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Radim Krčmář	5c1bb07871	RISC-V: KVM: fix off-by-one array access in SBI PMU The indexed array only has RISCV_KVM_MAX_COUNTERS elements. The out-of-bound access could have been performed by a guest, but it could only access another guest accessible data. Fixes: `8f0153ecd3` ("RISC-V: KVM: Add skeleton support for perf") Signed-off-by: Radim Krčmář <radim.krcmar@oss.qualcomm.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20260227134617.23378-1-radim.krcmar@oss.qualcomm.com Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Jiakai Xu	c28eb189e4	RISC-V: KVM: Fix null pointer dereference in kvm_riscv_vcpu_aia_rmw_topei() kvm_riscv_vcpu_aia_rmw_topei() assumes that the per-vCPU IMSIC state has been initialized once AIA is reported as available and initialized at the VM level. This assumption does not always hold. Under fuzzed ioctl sequences, a guest may access the IMSIC TOPEI CSR before the vCPU IMSIC state is set up. In this case, vcpu->arch.aia_context.imsic_state is still NULL, and the TOPEI RMW path dereferences it unconditionally, leading to a host kernel crash. The crash manifests as: Unable to handle kernel paging request at virtual address dfffffff0000000e ... kvm_riscv_vcpu_aia_imsic_rmw arch/riscv/kvm/aia_imsic.c:909 kvm_riscv_vcpu_aia_rmw_topei arch/riscv/kvm/aia.c:231 csr_insn arch/riscv/kvm/vcpu_insn.c:208 system_opcode_insn arch/riscv/kvm/vcpu_insn.c:281 kvm_riscv_vcpu_virtual_insn arch/riscv/kvm/vcpu_insn.c:355 kvm_riscv_vcpu_exit arch/riscv/kvm/vcpu_exit.c:230 kvm_arch_vcpu_ioctl_run arch/riscv/kvm/vcpu.c:1008 ... Fix this by explicitly checking whether the vCPU IMSIC state has been initialized before handling TOPEI CSR accesses. If not, forward the CSR emulation to user space. Fixes: `db8b7e97d6` ("RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC") Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn> Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com> Reviewed-by: Nutty Liu <nutty.liu@hotmail.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20260226085119.643295-1-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Jiakai Xu	dec9ed9944	RISC-V: KVM: Fix use-after-free in kvm_riscv_gstage_get_leaf() While fuzzing KVM on RISC-V, a use-after-free was observed in kvm_riscv_gstage_get_leaf(), where ptep_get() dereferences a freed gstage page table page during gfn unmap. The crash manifests as: use-after-free in ptep_get include/linux/pgtable.h:340 [inline] use-after-free in kvm_riscv_gstage_get_leaf arch/riscv/kvm/gstage.c:89 Call Trace: ptep_get include/linux/pgtable.h:340 [inline] kvm_riscv_gstage_get_leaf+0x2ea/0x358 arch/riscv/kvm/gstage.c:89 kvm_riscv_gstage_unmap_range+0xf0/0x308 arch/riscv/kvm/gstage.c:265 kvm_unmap_gfn_range+0x168/0x1fc arch/riscv/kvm/mmu.c:256 kvm_mmu_unmap_gfn_range virt/kvm/kvm_main.c:724 [inline] page last free pid 808 tgid 808 stack trace: kvm_riscv_mmu_free_pgd+0x1b6/0x26a arch/riscv/kvm/mmu.c:457 kvm_arch_flush_shadow_all+0x1a/0x24 arch/riscv/kvm/mmu.c:134 kvm_flush_shadow_all virt/kvm/kvm_main.c:344 [inline] The UAF is caused by gstage page table walks running concurrently with gstage pgd teardown. In particular, kvm_unmap_gfn_range() can traverse gstage page tables while kvm_arch_flush_shadow_all() frees the pgd, leading to use-after-free of page table pages. Fix the issue by serializing gstage unmap and pgd teardown with kvm->mmu_lock. Holding mmu_lock ensures that gstage page tables remain valid for the duration of unmap operations and prevents concurrent frees. This matches existing RISC-V KVM usage of mmu_lock to protect gstage map/unmap operations, e.g. kvm_riscv_mmu_iounmap. Fixes: `dd82e35638` ("RISC-V: KVM: Factor-out g-stage page table management") Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn> Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20260202040059.1801167-1-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Lukas Gerlach	8565617a85	KVM: riscv: Fix Spectre-v1 in APLIC interrupt handling Guests can control IRQ indices via MMIO. Sanitize them with array_index_nospec() to prevent speculative out-of-bounds access to the aplic->irqs[] array. Similar to arm64 commit `41b87599c7` ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()") and x86 commit `8c86405f60` ("KVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacks"). Fixes: `74967aa208` ("RISC-V: KVM: Add in-kernel emulation of AIA APLIC") Signed-off-by: Lukas Gerlach <lukas.gerlach@cispa.de> Reviewed-by: Nutty Liu <nutty.liu@hotmail.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20260116095731.24555-1-lukas.gerlach@cispa.de Signed-off-by: Anup Patel <anup@brainfault.org>	2026-03-06 11:20:30 +05:30
Fuad Tabba	e07fc9e2da	KVM: arm64: Fix page leak in user_mem_abort() on atomic fault When a guest performs an atomic/exclusive operation on memory lacking the required attributes, user_mem_abort() injects a data abort and returns early. However, it fails to release the reference to the host page acquired via __kvm_faultin_pfn(). A malicious guest could repeatedly trigger this fault, leaking host page references and eventually causing host memory exhaustion (OOM). Fix this by consolidating the early error returns to a new out_put_page label that correctly calls kvm_release_page_unused(). Fixes: `2937aeec9d` ("KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory") Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com> Link: https://patch.msgid.link/20260304162222.836152-2-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-05 16:23:30 +00:00
Zenghui Yu (Huawei)	eb54fa1025	KVM: arm64: nv: Inject a SEA if failed to read the descriptor Failure to read the descriptor (because it is outside of a memslot) should result in a SEA being injected in the guest. Suggested-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/86ms1m9lp3.wl-maz@kernel.org Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Link: https://patch.msgid.link/20260225173515.20490-4-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-05 15:46:48 +00:00
Zenghui Yu (Huawei)	99a339377f	KVM: arm64: nv: Report addrsz fault at level 0 with a bad VTTBR.BADDR As per R_BFHQH, " When an Address size fault is generated, the reported fault code indicates one of the following: If the fault was generated due to the TTBR_ELx used in the translation having nonzero address bits above the OA size, then a fault at level 0. " Fix the reported Address size fault level as being 0 if the base address is wrongly programmed by L1. Fixes: `61e30b9eef` ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic") Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Link: https://patch.msgid.link/20260225173515.20490-3-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-05 15:46:48 +00:00
Zenghui Yu (Huawei)	4c2264ecdf	KVM: arm64: nv: Check S2 limits based on implemented PA size check_base_s2_limits() checks the validity of SL0 and inputsize against ia_size (inputsize again!) but the pseudocode from DDI0487 G.a AArch64.TranslationTableWalk() says that we should check against the implemented PA size. We would otherwise fail to walk S2 with a valid configuration. E.g., granule size = 4KB, inputsize = 40 bits, initial lookup level = 0 (no concatenation) on a system with 48 bits PA range supported is allowed by architecture. Fix it by obtaining PA size by kvm_get_pa_bits(). Note that kvm_get_pa_bits() returns the fixed limit now and should eventually reflect the per VM PARange (one day!). Given that the configured PARange should not be greater that kvm_ipa_limit, it at least fixes the problem described above. While at it, inject a level 0 translation fault to guest if check_base_s2_limits() fails, as per the pseudocode. Fixes: `61e30b9eef` ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic") Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Link: https://patch.msgid.link/20260225173515.20490-2-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-05 15:46:47 +00:00
Marc Zyngier	8531d5a83d	KVM: arm64: pkvm: Fallback to level-3 mapping on host stage-2 fault If, for any odd reason, we cannot converge to mapping size that is completely contained in a memblock region, we fail to install a S2 mapping and go back to the faulting instruction. Rince, repeat. This happens when faulting in regions that are smaller than a page or that do not have PAGE_SIZE-aligned boundaries (as witnessed on an O6 board that refuses to boot in protected mode). In this situation, fallback to using a PAGE_SIZE mapping anyway -- it isn't like we can go any lower. Fixes: `e728e70580` ("KVM: arm64: Adjust range correctly during host stage-2 faults") Link: https://lore.kernel.org/r/86wlzr77cn.wl-maz@kernel.org Cc: stable@vger.kernel.org Cc: Quentin Perret <qperret@google.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://patch.msgid.link/20260305132751.2928138-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-03-05 15:18:47 +00:00
Marc Zyngier	ac6769c8f9	KVM: arm64: Eagerly init vgic dist/redist on vgic creation If vgic_allocate_private_irqs_locked() fails for any odd reason, we exit kvm_vgic_create() early, leaving dist->rd_regions uninitialised. kvm_vgic_dist_destroy() then comes along and walks into the weeds trying to free the RDs. Got to love this stuff. Solve it by moving all the static initialisation early, and make sure that if we fail halfway, we're in a reasonable shape to perform the rest of the teardown. While at it, reset the vgic model on failure, just in case... Reported-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com Tested-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com Fixes: `b3aa9283c0` ("KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()") Link: https://lore.kernel.org/r/69a2d58c.050a0220.3a55be.003b.GAE@google.com Link: https://patch.msgid.link/20260228164559.936268-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org	2026-03-05 15:18:38 +00:00
Sean Christopherson	f8211e95df	Documentation: KVM: Formalizing taking vcpu->mutex outside of kvm->slots_lock Explicitly document the ordering of vcpu->mutex being taken outside of kvm->slots_lock. While somewhat unintuitive since vCPUs conceptually have narrower scope than VMs, the scope of the owning object (vCPU versus VM) doesn't automatically carry over to the lock. In this case, vcpu->mutex has far broader scope than kvm->slots_lock. As Paolo put it, it's a "don't worry about multiple ioctls at the same time" mutex that's intended to be taken at the outer edges of KVM. More importantly, arm64 and x86 have gained flows that take kvm->slots_lock inside of vcpu->mutex. x86's kvm_inhibit_apic_access_page() is particularly nasty, as slots_lock is taken quite deep within KVM_RUN, i.e. simply swapping the ordering isn't an option. Commit to the vcpu->mutex => kvm->slots_lock ordering, as vcpu->mutex really is intended to be a "top-level" lock, whereas kvm->slots_lock is "just" a helper lock. Opportunistically document that vcpu->mutex is also taken outside of slots_arch_lock, e.g. when allocating shadow roots on x86 (which is the entire reason slots_arch_lock exists, as shadow roots must be allocated while holding kvm->srcu) kvm_mmu_new_pgd() \| -> kvm_mmu_reload() \| -> kvm_mmu_load() \| -> mmu_alloc_shadow_roots() \| -> mmu_first_shadow_root_alloc() but also when manipulating memslots in vCPU context, e.g. when inhibiting the APIC-access page via the aforementioned kvm_inhibit_apic_access_page() kvm_inhibit_apic_access_page() \| -> __x86_set_memory_region() \| -> kvm_set_internal_memslot() \| -> kvm_set_memory_region() \| -> kvm_set_memslot() Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Marc Zyngier <maz@kernel.org> Link: https://patch.msgid.link/20260302170239.596810-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-02 09:52:09 -08:00
Linus Torvalds	11439c4635	Linux 7.0-rc2 v7.0-rc2	2026-03-01 15:39:31 -08:00
Linus Torvalds	949d0a46ad	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "Arm: - Make sure we don't leak any S1POE state from guest to guest when the feature is supported on the HW, but not enabled on the host - Propagate the ID registers from the host into non-protected VMs managed by pKVM, ensuring that the guest sees the intended feature set - Drop double kern_hyp_va() from unpin_host_sve_state(), which could bite us if we were to change kern_hyp_va() to not being idempotent - Don't leak stage-2 mappings in protected mode - Correctly align the faulting address when dealing with single page stage-2 mappings for PAGE_SIZE > 4kB - Fix detection of virtualisation-capable GICv5 IRS, due to the maintainer being obviously fat fingered... [his words, not mine] - Remove duplication of code retrieving the ASID for the purpose of S1 PT handling - Fix slightly abusive const-ification in vgic_set_kvm_info() Generic: - Remove internal Kconfigs that are now set on all architectures - Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all architectures finally enable it in Linux 7.0" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: always define KVM_CAP_SYNC_MMU KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER KVM: arm64: Deduplicate ASID retrieval code irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs KVM: arm64: Fix protected mode handling of pages larger than 4kB KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state() KVM: arm64: Fix ID register initialization for non-protected pKVM guests KVM: arm64: Optimise away S1POE handling when not supported by host KVM: arm64: Hide S1POE from guests when not supported by the host	2026-03-01 15:34:47 -08:00
Linus Torvalds	e2bd1b1369	Merge tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull debugobjects fix from Thomas Gleixner: "A single fix for debugobjects. The deferred page initialization prevents debug objects from allocating slab pages until the initialization is complete. That causes depletion of the pool and disabling of debugobjects. The reason is that debugobjects uses __GFP_HIGH for allocations as it might be invoked from arbitrary contexts. When PREEMPT_COUNT is disabled there is no way to know whether the context is safe to set __GFP_KSWAPD_RECLAIM. This worked until v6.18. Since then allocations w/o a reclaim flag cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(), which returns early when deferred page initialization has not yet completed. Work around that when PREEMPT_COUNT is enabled as the preempt counter allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when the context is preemtible. When PREEMPT_COUNT is disabled the context is unknown and the reclaim bit can't be set because the caller might hold locks which might deadlock in the allocator. That makes debugobjects depend on PREEMPT_COUNT \|\| !DEFERRED_STRUCT_PAGE_INIT, which limits the coverage slightly, but keeps it functional for most cases" * tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: debugobject: Make it work with deferred page initialization - again	2026-03-01 13:32:32 -08:00
Linus Torvalds	5920da4455	Merge tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix speculative safety in fred_extint() - Fix __WARN_printf() trap in early_fixup_exception() - Fix clang-build boot bug for unusual alignments, triggered by CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y - Replace the final few __ASSEMBLY__ stragglers that snuck in lately into non-UAPI x86 headers and use __ASSEMBLER__ consistently (again) * tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/headers: Replace __ASSEMBLY__ stragglers with __ASSEMBLER__ x86/cfi: Fix CFI rewrite for odd alignments x86/bug: Handle __WARN_printf() trap in early_fixup_exception() x86/fred: Correct speculative safety in fred_extint()	2026-03-01 13:16:35 -08:00
Linus Torvalds	f6542af922	Merge tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for the common HZ=100, 250 or 1000 cases. Only use a function call for odd HZ values like HZ=300 that generate more code. The function call overhead showed up in performance tests of the TCP code" * tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()	2026-03-01 12:15:58 -08:00
Linus Torvalds	6170625149	Merge tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix zero_vruntime tracking when there's a single task running - Fix slice protection logic - Fix the ->vprot logic for reniced tasks - Fix lag clamping in mixed slice workloads - Fix objtool uaccess warning (and bug) in the !CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining, which triggers with older compilers - Fix a comment in the rseq registration rseq_size bound check code - Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes differently, which special size we now reached naturally and want to avoid. The visible ugliness of the new reserved field will be avoided the next time the RSEQ area is extended. * tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: slice ext: Ensure rseq feature size differs from original rseq size rseq: Clarify rseq registration rseq_size bound check comment sched/core: Fix wakeup_preempt's next_class tracking rseq: Mark rseq_arm_slice_extension_timer() __always_inline sched/fair: Fix lag clamp sched/eevdf: Update se->vprot in reweight_entity() sched/fair: Only set slice protection at pick time sched/fair: Fix zero_vruntime tracking	2026-03-01 11:09:24 -08:00
Linus Torvalds	cb36eabcaf	Merge tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fixes from Ingo Molnar: - Fix lock ordering bug found by lockdep in perf_event_wakeup() - Fix uncore counter enumeration on Granite Rapids and Sierra Forest - Fix perf_mmap() refcount bug found by Syzkaller - Fix __perf_event_overflow() vs perf_remove_from_context() race * tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix __perf_event_overflow() vs perf_remove_from_context() race perf/core: Fix refcount bug and potential UAF in perf_mmap perf/x86/intel/uncore: Add per-scheduler IMC CAS count events perf/core: Fix invalid wait context in ctx_sched_in()	2026-03-01 11:07:20 -08:00
Linus Torvalds	b410220870	Merge tag 'locking-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Now that LLVM 22 has been released officially, require a release version to use the new CONFIG_WARN_CONTEXT_ANALYSIS feature. In particular this avoids the widely used Android clang 22.0.1 pre-release build which is known to be broken for this usecase" * tag 'locking-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lib/Kconfig.debug: Require a release version of LLVM 22 for context analysis	2026-03-01 11:00:43 -08:00
Linus Torvalds	afa844360b	Merge tag 'irq-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irqchip driver fixes from Ingo Molnar: - Fix frozen interrupt bug in the sifive-plic driver - Limit per-device MSI interrupts on uncommon gic-v3-its hardware variants - Address Sparse warning by constifying a variable in the MMP driver - Revert broken commit and also fix an error check in the ls-extirq driver * tag 'irq-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/ls-extirq: Fix devm_of_iomap() error check Revert "irqchip/ls-extirq: Use for_each_of_imap_item iterator" irqchip/mmp: Make icu_irq_chip variable static const irqchip/gic-v3-its: Limit number of per-device MSIs to the range the ITS supports irqchip/sifive-plic: Fix frozen interrupt due to affinity setting	2026-03-01 10:58:16 -08:00
Linus Torvalds	39c6332614	Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "All changes in drivers (well technically SES is enclosure services, but its change is minor). The biggest is the write combining change in lpfc followed by the additional NULL checks in mpi3mr" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: ufs: core: Fix shift out of bounds when MAXQ=32 scsi: ufs: core: Move link recovery for hibern8 exit failure to wl_resume scsi: ufs: core: Fix possible NULL pointer dereference in ufshcd_add_command_trace() scsi: snic: MAINTAINERS: Update snic maintainers scsi: snic: Remove unused linkstatus scsi: pm8001: Fix use-after-free in pm8001_queue_command() scsi: mpi3mr: Add NULL checks when resetting request and reply queues scsi: ufs: core: Reset urgent_bkops_lvl to allow runtime PM power mode scsi: ses: Fix devices attaching to different hosts scsi: ufs: core: Fix RPMB region size detection for UFS 2.2 scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT scsi: lpfc: Properly set WC for DPP mapping	2026-03-01 09:59:29 -08:00
Linus Torvalds	eb71ab2bf7	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix alignment of arm64 JIT buffer to prevent atomic tearing (Fuad Tabba) - Fix invariant violation for single value tnums in the verifier (Harishankar Vishwanathan, Paul Chaignon) - Fix a bunch of issues found by ASAN in selftests/bpf (Ihor Solodrai) - Fix race in devmpa and cpumap on PREEMPT_RT (Jiayuan Chen) - Fix show_fdinfo of kprobe_multi when cookies are not present (Jiri Olsa) - Fix race in freeing special fields in BPF maps to prevent memory leaks (Kumar Kartikeya Dwivedi) - Fix OOB read in dmabuf_collector (T.J. Mercier) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (36 commits) selftests/bpf: Avoid simplification of crafted bounds test selftests/bpf: Test refinement of single-value tnum bpf: Improve bounds when tnum has a single possible value bpf: Introduce tnum_step to step through tnum's members bpf: Fix race in devmap on PREEMPT_RT bpf: Fix race in cpumap on PREEMPT_RT selftests/bpf: Add tests for special fields races bpf: Retire rcu_trace_implies_rcu_gp() from local storage bpf: Delay freeing fields in local storage bpf: Lose const-ness of map in map_check_btf() bpf: Register dtor for freeing special fields selftests/bpf: Fix OOB read in dmabuf_collector selftests/bpf: Fix a memory leak in xdp_flowtable test bpf: Fix stack-out-of-bounds write in devmap bpf: Fix kprobe_multi cookies access in show_fdinfo callback bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing selftests/bpf: Don't override SIGSEGV handler with ASAN selftests/bpf: Check BPFTOOL env var in detect_bpftool_path() selftests/bpf: Fix out-of-bounds array access bugs reported by ASAN selftests/bpf: Fix array bounds warning in jit_disasm_helpers ...	2026-02-28 19:54:28 -08:00
Linus Torvalds	63a43faf6a	Merge tag 'driver-core-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core fixes from Danilo Krummrich: - Do not register imx_clk_scu_driver in imx8qxp_clk_probe(); besides fixing two other issues, this avoids a deadlock in combination with commit `dc23806a7c` ("driver core: enforce device_lock for driver_match_device()") - Move secondary node lookup from device_get_next_child_node() to fwnode_get_next_child_node(); this avoids issues when users switch from the device API to the fwnode API - Export io_define_{read,write}!() to avoid unused import warnings when CONFIG_PCI=n * tag 'driver-core-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: clk: scu/imx8qxp: do not register driver in probe() rust: io: macro_export io_define_read!() and io_define_write!() device property: Allow secondary lookup in fwnode_get_next_child_node()	2026-02-28 19:35:30 -08:00
Linus Torvalds	42eb017830	Merge tag 'v7.0rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - Two multichannel fixes - Locking fix for superblock flags - Fix to remove debug message that could log password - Cleanup fix for setting credentials * tag 'v7.0rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: Use snprintf in cifs_set_cifscreds smb: client: Don't log plaintext credentials in cifs_set_cifscreds smb: client: fix broken multichannel with krb5+signing smb: client: use atomic_t for mnt_cifs_flags smb: client: fix cifs_pick_channel when channels are equally loaded	2026-02-28 10:45:56 -08:00
Takashi Sakamoto	9197e5949a	firewire: ohci: initialize page array to use alloc_pages_bulk() correctly The call of alloc_pages_bulk() skips to fill entries of page array when the entries already have values. While, 1394 OHCI PCI driver passes the page array without initializing. It could cause invalid state at PFN validation in vmap(). Fixes: `f2ae92780a` ("firewire: ohci: split page allocation from dma mapping") Reported-by: John Ogness <john.ogness@linutronix.de> Reported-and-tested-by: Harald Arnesen <linux@skogtun.org> Reported-and-tested-by: David Gow <david@davidgow.net> Closes: https://lore.kernel.org/lkml/87tsv1vig5.fsf@jogness.linutronix.de/ Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-28 10:09:24 -08:00
Linus Torvalds	2f9339c052	Merge tag 'spi-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "One fix for the stm32 driver which got broken for DMA chaining cases, plus a removal of some straggling bindings for the Bikal SoC which has been pulled out of the kernel" * tag 'spi-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: stm32: fix missing pointer assignment in case of dma chaining spi: dt-bindings: snps,dw-abp-ssi: Remove unused bindings	2026-02-28 09:21:18 -08:00
Linus Torvalds	463e133751	Merge tag 'regulator-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "A small pile of fixes, none of which are super major - the code fixes are improved error handling and fixing a leak of a device node. We also have a typo fix and an improvement to make the binding example for mt6359 more directly usable" * tag 'regulator-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: Kconfig: fix a typo regulator: bq257xx: Fix device node reference leak in bq257xx_reg_dt_parse_gpio() regulator: fp9931: Fix PM runtime reference leak in fp9931_hwmon_read() regulator: tps65185: check devm_kzalloc() result in probe regulator: dt-bindings: mt6359: make regulator names unique	2026-02-28 09:18:02 -08:00
Linus Torvalds	201795a1b7	Merge tag 's390-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix guest pfault init to pass a physical address to DIAG 0x258, restoring pfault interrupts and avoiding vCPU stalls during host page-in - Fix kexec/kdump hangs with stack protector by marking s390_reset_system() __no_stack_protector; set_prefix(0) switches lowcore and the canary no longer matches - Fix idle/vtime cputime accounting (idle-exit ordering, vtimer double-forwarding) and small cleanups * tag 's390-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/pfault: Fix virtual vs physical address confusion s390/kexec: Disable stack protector in s390_reset_system() s390/idle: Remove psw_idle() prototype s390/vtime: Use lockdep_assert_irqs_disabled() instead of BUG_ON() s390/vtime: Use __this_cpu_read() / get rid of READ_ONCE() s390/irq/idle: Remove psw bits early s390/idle: Inline update_timer_idle() s390/idle: Slightly optimize idle time accounting s390/idle: Add comment for non obvious code s390/vtime: Fix virtual timer forwarding s390/idle: Fix cpu idle exit cpu time accounting	2026-02-28 09:01:33 -08:00
Paolo Bonzini	55365ab85a	Merge tag 'kvmarm-fixes-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.0, take #1 - Make sure we don't leak any S1POE state from guest to guest when the feature is supported on the HW, but not enabled on the host - Propagate the ID registers from the host into non-protected VMs managed by pKVM, ensuring that the guest sees the intended feature set - Drop double kern_hyp_va() from unpin_host_sve_state(), which could bite us if we were to change kern_hyp_va() to not being idempotent - Don't leak stage-2 mappings in protected mode - Correctly align the faulting address when dealing with single page stage-2 mappings for PAGE_SIZE > 4kB - Fix detection of virtualisation-capable GICv5 IRS, due to the maintainer being obviously fat fingered... - Remove duplication of code retrieving the ASID for the purpose of S1 PT handling - Fix slightly abusive const-ification in vgic_set_kvm_info()	2026-02-28 15:33:34 +01:00
Paolo Bonzini	70295a479d	KVM: always define KVM_CAP_SYNC_MMU KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always available. Move the definition from individual architectures to common code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-02-28 15:31:35 +01:00
Paolo Bonzini	407fd8b8d8	KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER All architectures now use MMU notifier for KVM page table management. Remove the Kconfig symbol and the code that is used when it is disabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-02-28 15:31:35 +01:00
Alexei Starovoitov	b9c0a5c483	Merge branch 'fix-invariant-violation-for-single-value-tnums' Paul Chaignon says: ==================== Fix invariant violation for single-value tnums We're hitting an invariant violation in Cilium that sometimes leads to BPF programs being rejected and Cilium failing to start [1]. As far as I know this is the first case of invariant violation found in a real program (i.e., not by a fuzzer). The following extract from verifier logs shows what's happening: from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 ; if (magic == MARK_MAGIC_HOST \|\| magic == MARK_MAGIC_OVERLAY \|\| magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337 236: (16) if w9 == 0xe00 goto pc+45 ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) 237: (16) if w9 == 0xf00 goto pc+1 verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0) More details are given in the second patch, but in short, the verifier should be able to detect that the false branch of instruction 237 is never true. After instruction 236, the u64 range and the tnum overlap in a single value, 0xf00. The long-term solution to invariant violation is likely to rely on the refinement + invariant violation check to detect dead branches, as started by Eduard. To fix the current issue, we need something with less refactoring that we can backport to affected kernels. The solution implemented in the second patch is to improve the bounds refinement to avoid this case. It relies on a new tnum helper, tnum_step, first sent as an RFC in [2]. The last two patches extend and update the selftests. Link: https://github.com/cilium/cilium/issues/44216 [1] Link: https://lore.kernel.org/bpf/20251107192328.2190680-2-harishankar.vishwanathan@gmail.com/ [2] Changes in v3: - Fix commit description error spotted by AI bot. - Simplify constants in first two tests (Eduard). - Rework comment on third test (Eduard). - Add two new negative test cases (Eduard). - Rebased. Changes in v2: - Add guard suggested by Hari in tnum_step, to avoid undefined behavior spotted by AI code review. - Add explanation diagrams in code as suggested by Eduard. - Rework conditions for readability as suggested by Eduard. - Updated reference to SMT formula. - Rebased. ==================== Link: https://patch.msgid.link/cover.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 16:11:50 -08:00
Paul Chaignon	024cea2d64	selftests/bpf: Avoid simplification of crafted bounds test The reg_bounds_crafted tests validate the verifier's range analysis logic. They focus on the actual ranges and thus ignore the tnum. As a consequence, they carry the assumption that the tested cases can be reproduced in userspace without using the tnum information. Unfortunately, the previous change the refinement logic breaks that assumption for one test case: (u64)2147483648 (u32)<op> [4294967294; 0x100000000] The tested bytecode is shown below. Without our previous improvement, on the false branch of the condition, R7 is only known to have u64 range [0xfffffffe; 0x100000000]. With our improvement, and using the tnum information, we can deduce that R7 equals 0x100000000. 19: (bc) w0 = w6 ; R6=0x80000000 20: (bc) w0 = w7 ; R7=scalar(smin=umin=0xfffffffe,smax=umax=0x100000000,smin32=-2,smax32=0,var_off=(0x0; 0x1ffffffff)) 21: (be) if w6 <= w7 goto pc+3 ; R6=0x80000000 R7=0x100000000 R7's tnum is (0; 0x1ffffffff). On the false branch, regs_refine_cond_op refines R7's u32 range to [0; 0x7fffffff]. Then, __reg32_deduce_bounds refines the s32 range to 0 using u32 and finally also sets u32=0. From this, __reg_bound_offset improves the tnum to (0; 0x100000000). Finally, our previous patch uses this new tnum to deduce that it only intersect with u64=[0xfffffffe; 0x100000000] in a single value: 0x100000000. Because the verifier uses the tnum to reach this constant value, the selftest is unable to reproduce it by only simulating ranges. The solution implemented in this patch is to change the test case such that there is more than one overlap value between u64 and the tnum. The max. u64 value is thus changed from 0x100000000 to 0x300000000. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/50641c6a7ef39520595dcafa605692427c1006ec.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 16:11:50 -08:00
Paul Chaignon	e6ad477d1b	selftests/bpf: Test refinement of single-value tnum This patch introduces selftests to cover the new bounds refinement logic introduced in the previous patch. Without the previous patch, the first two tests fail because of the invariant violation they trigger. The last test fails because the R10 access is not detected as dead code. In addition, all three tests fail because of R0 having a non-constant value in the verifier logs. In addition, the last two cases are covering the negative cases: when we shouldn't refine the bounds because the u64 and tnum overlap in at least two values. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/90d880c8cf587b9f7dc715d8961cd1b8111d01a8.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 16:11:50 -08:00
Paul Chaignon	efc11a6678	bpf: Improve bounds when tnum has a single possible value We're hitting an invariant violation in Cilium that sometimes leads to BPF programs being rejected and Cilium failing to start [1]. The following extract from verifier logs shows what's happening: from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 ; if (magic == MARK_MAGIC_HOST \|\| magic == MARK_MAGIC_OVERLAY \|\| magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337 236: (16) if w9 == 0xe00 goto pc+45 ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) 237: (16) if w9 == 0xf00 goto pc+1 verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0) We reach instruction 236 with two possible values for R9, 0xe00 and 0xf00. This is perfectly reflected in the tnum, but of course the ranges are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path at instruction 236 allows the verifier to reduce the range to [0xe01; 0xf00]. The tnum is however not updated. With these ranges, at instruction 237, the verifier is not able to deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is explored first, the verifier refines the bounds using the assumption that R9 != 0xf00, and ends up with an invariant violation. This pattern of impossible branch + bounds refinement is common to all invariant violations seen so far. The long-term solution is likely to rely on the refinement + invariant violation check to detect dead branches, as started by Eduard. To fix the current issue, we need something with less refactoring that we can backport. This patch uses the tnum_step helper introduced in the previous patch to detect the above situation. In particular, three cases are now detected in the bounds refinement: 1. The u64 range and the tnum only overlap in umin. u64: ---[xxxxxx]----- tnum: --xx----------x- 2. The u64 range and the tnum only overlap in the maximum value represented by the tnum, called tmax. u64: ---[xxxxxx]----- tnum: xx-----x-------- 3. The u64 range and the tnum only overlap in between umin (excluded) and umax. u64: ---[xxxxxx]----- tnum: xx----x-------x- To detect these three cases, we call tnum_step(tnum, umin), which returns the smallest member of the tnum greater than umin, called tnum_next here. We're in case (1) if umin is part of the tnum and tnum_next is greater than umax. We're in case (2) if umin is not part of the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if umin is not part of the tnum, tnum_next is inferior or equal to umax, and calling tnum_step a second time gives us a value past umax. This change implements these three cases. With it, the above bytecode looks as follows: 0: (85) call bpf_get_prandom_u32#7 ; R0=scalar() 1: (47) r0 \|= 3584 ; R0=scalar(smin=0x8000000000000e00,umin=umin32=3584,smin32=0x80000e00,var_off=(0xe00; 0xfffffffffffff1ff)) 2: (57) r0 &= 3840 ; R0=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) 3: (15) if r0 == 0xe00 goto pc+2 ; R0=3840 4: (15) if r0 == 0xf00 goto pc+1 4: R0=3840 6: (95) exit In addition to the new selftests, this change was also verified with Agni [3]. For the record, the raw SMT is available at [4]. The property it verifies is that: If a concrete value x is contained in all input abstract values, after __update_reg_bounds, it will continue to be contained in all output abstract values. Link: https://github.com/cilium/cilium/issues/44216 [1] Link: https://pchaigno.github.io/test-verifier-complexity.html [2] Link: https://github.com/bpfverif/agni [3] Link: https://pastebin.com/raw/naCfaqNx [4] Fixes: `0df1a55afa` ("bpf: Warn on internal verifier errors") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Marco Schirrmeister <mschirrmeister@gmail.com> Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/ef254c4f68be19bd393d450188946821c588565d.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 16:11:50 -08:00

1 2 3 4 5 ...

1427358 Commits