linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-03-21 23:16:50 +08:00

Author	SHA1	Message	Date
Paolo Bonzini	94fe3e6515	Merge tag 'kvm-x86-generic-7.0-rc3' of https://github.com/kvm-x86/linux into HEAD KVM generic changes for 7.0 - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being unnecessary and confusing, triggered compiler warnings due to -Wflex-array-member-not-at-end. - Document that vcpu->mutex is take outside of kvm->slots_lock and kvm->slots_arch_lock, which is intentional and desirable despite being rather unintuitive.	2026-03-11 18:01:55 +01:00
Paolo Bonzini	70295a479d	KVM: always define KVM_CAP_SYNC_MMU KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always available. Move the definition from individual architectures to common code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-02-28 15:31:35 +01:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Paolo Bonzini	bf2c3138ae	Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD KVM mediated PMU support for 6.20 Add support for mediated PMUs, where KVM gives the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allows direct access to data MSRs and PMCs (restricted by the vPMU model), but intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. To keep overall complexity reasonable, mediated PMU usage is all or nothing for a given instance of KVM (controlled via module param). The Mediated PMU is disabled default, partly to maintain backwards compatilibity for existing setup, partly because there are tradeoffs when running with a mediated PMU that may be non-starters for some use cases, e.g. the host loses the ability to profile guests with mediated PMUs, the fastpath run loop is also a blind spot, entry/exit transitions are more expensive, etc. Versus the emulated PMU, where KVM is "just another perf user", the mediated PMU delivers more accurate profiling and monitoring (no risk of contention and thus dropped events), with significantly less overhead (fewer exits and faster emulation/programming of event selectors) E.g. when running Specint-2017 on a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from within the guest: Perf command: a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite Guest performance overhead: --------------------------------------------------------------------------- \| Test case \| emulated vPMU \| all passthrough \| passthrough with \| \| \| \| \| event filters \| --------------------------------------------------------------------------- \| basic-sampling \| 33.62% \| 4.24% \| 6.21% \| --------------------------------------------------------------------------- \| multiplex-sampling \| 79.32% \| 7.34% \| 10.45% \| ---------------------------------------------------------------------------	2026-02-11 12:45:40 -05:00
Paolo Bonzini	1b13885edf	Merge tag 'kvm-x86-apic-6.20' of https://github.com/kvm-x86/linux into HEAD KVM x86 APIC-ish changes for 6.20 - Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when creating a vCPU-specific mapping of guest memory. - Clean up KVM's handling of marking mapped vCPU pages dirty. - Drop a pile of ancient sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless. - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow burying the RTC shenanigans behind CONFIG_KVM_IOAPIC=y. - Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y. - Add a regression test for recent APICv update fixes. - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans. - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information. - Drop a dead function declaration.	2026-02-11 12:45:32 -05:00
Paolo Bonzini	9e03b7caf4	Merge tag 'kvm-x86-misc-6.20' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.20 - Disallow changing the virtual CPU model if L2 is active, for all the same reasons KVM disallows change the model after the first KVM_RUN. - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that were advertised as supported to userspace when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled. - Fix a bug where KVM would attempt to read protect guest state (CR3) when configuring an async #PF entry. - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Explicitly allow the few exports that are intended for external usage. - Ignore -EBUSY when checking nested events after a vCPU exits blocking as the WARN is user-triggerable, and because exiting to userspace on -EBUSY does more harm than good in pretty much every situation. - Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an unwinnable game. - Add support for new Intel instructions that don't require anything beyond enumerating feature flags to userspace. - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2. - Add WARNs to guard against modifying KVM's CPU caps outside of the intended setup flow, as nested VMX in particular is sensitive to unexpected changes in KVM's golden configuration. - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts when the suppression feature is enabled by the guest (currently limited to split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an option as some userspaces have come to rely on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective of whether or not userspace I/O APIC supports Directed EOIs). - Minor cleanups.	2026-02-09 18:53:47 +01:00
Paolo Bonzini	4215ee0d7b	Merge tag 'kvm-x86-svm-6.20' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 6.20 - Drop a user-triggerable WARN on nested_svm_load_cr3() failure. - Add support for virtualizing ERAPS. Note, correct virtualization of ERAPS relies on an upcoming, publicly announced change in the APM to reduce the set of conditions where hardware (i.e. KVM) must flush the RAP. - Ignore nSVM intercepts for instructions that are not supported according to L1's virtual CPU model. - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath for EPT Misconfig. - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM, and allow userspace to restore nested state with GIF=0. - Treat exit_code as an unsigned 64-bit value through all of KVM. - Add support for fetching SNP certificates from userspace. - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD or VMSAVE on behalf of L2. - Misc fixes and cleanups.	2026-02-09 18:51:37 +01:00
Khushit Shah	6517dfbcc9	KVM: x86: Add x2APIC "features" to control EOI broadcast suppression Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated by userspace), which KVM completely mishandles. When x2APIC support was first added, KVM incorrectly advertised and "enabled" Suppress EOI Broadcast, without fully supporting the I/O APIC side of the equation, i.e. without adding directed EOI to KVM's in-kernel I/O APIC. That flaw was carried over to split IRQCHIP support, i.e. KVM advertised support for Suppress EOI Broadcasts irrespective of whether or not the userspace I/O APIC implementation supported directed EOIs. Even worse, KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without support for directed EOI came to rely on the "spurious" broadcasts. KVM "fixed" the in-kernel I/O APIC implementation by completely disabling support for Suppress EOI Broadcasts in commit `0bcc3fb95b` ("KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but didn't do anything to remedy userspace I/O APIC implementations. KVM's bogus handling of Suppress EOI Broadcast is problematic when the guest relies on interrupts being masked in the I/O APIC until well after the initial local APIC EOI. E.g. Windows with Credential Guard enabled handles interrupts in the following order: 1. Interrupt for L2 arrives. 2. L1 APIC EOIs the interrupt. 3. L1 resumes L2 and injects the interrupt. 4. L2 EOIs after servicing. 5. L1 performs the I/O APIC EOI. Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt storm, e.g. if the IRQ line is still asserted and userspace reacts to the EOI by re-injecting the IRQ, because the guest doesn't de-assert the line until step #4, and doesn't expect the interrupt to be re-enabled until step #5. Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way of knowing if the userspace I/O APIC supports directed EOIs, i.e. suppressing EOI broadcasts would result in interrupts being stuck masked in the userspace I/O APIC due to step #5 being ignored by userspace. And fully disabling support for Suppress EOI Broadcast is also undesirable, as picking up the fix would require a guest reboot, and more importantly would change the virtual CPU model exposed to the guest without any buy-in from userspace. Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to explicitly enable or disable support for Suppress EOI Broadcasts. This gives userspace control over the virtual CPU model exposed to the guest, as KVM should never have enabled support for Suppress EOI Broadcast without userspace opt-in. Not setting either flag will result in legacy quirky behavior for backward compatibility. Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel I/O APIC, as KVM's history/support is just as tragic. E.g. it's not clear that commit `c806a6ad35` ("KVM: x86: call irq notifiers with directed EOI") was entirely correct, i.e. it may have simply papered over the lack of Directed EOI emulation in the I/O APIC. Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI is to support Directed EOI (KVM's name) irrespective of guest CPU vendor. Fixes: `7543a635aa` ("KVM: x86: Add KVM exit for IOAPIC EOIs") Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com Cc: stable@vger.kernel.org Suggested-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Khushit Shah <khushit.shah@nutanix.com> Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com [sean: clean up minor formatting goofs and fix a comment typo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-30 13:28:35 -08:00
Sean Christopherson	f8ade833b7	KVM: x86: Explicitly configure supported XSS from {svm,vmx}_set_cpu_caps() Explicitly configure KVM's supported XSS as part of each vendor's setup flow to fix a bug where clearing SHSTK and IBT in kvm_cpu_caps, e.g. due to lack of CET XFEATURE support, makes kvm-intel.ko unloadable when nested VMX is enabled, i.e. when nested=1. The late clearing results in nested_vmx_setup_{entry,exit}_ctls() clearing VM_{ENTRY,EXIT}_LOAD_CET_STATE when nested_vmx_setup_ctls_msrs() runs during the CPU compatibility checks, ultimately leading to a mismatched VMCS config due to the reference config having the CET bits set, but every CPU's "local" config having the bits cleared. Note, kvm_caps.supported_{xcr0,xss} are unconditionally initialized by kvm_x86_vendor_init(), before calling into vendor code, and not referenced between ops->hardware_setup() and their current/old location. Fixes: `69cc3e8865` ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER") Cc: stable@vger.kernel.org Cc: Mathias Krause <minipli@grsecurity.net> Cc: John Allen <john.allen@amd.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Chao Gao <chao.gao@intel.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-30 13:27:33 -08:00
Vasiliy Kovalev	95d848dc7e	KVM: x86: Add SRCU protection for reading PDPTRs in __get_sregs2() Add SRCU read-side protection when reading PDPTR registers in __get_sregs2(). Reading PDPTRs may trigger access to guest memory: kvm_pdptr_read() -> svm_cache_reg() -> load_pdptrs() -> kvm_vcpu_read_guest_page() -> kvm_vcpu_gfn_to_memslot() kvm_vcpu_gfn_to_memslot() dereferences memslots via __kvm_memslots(), which uses srcu_dereference_check() and requires either kvm->srcu or kvm->slots_lock to be held. Currently only vcpu->mutex is held, triggering lockdep warning: ============================= WARNING: suspicious RCU usage in kvm_vcpu_gfn_to_memslot 6.12.59+ #3 Not tainted include/linux/kvm_host.h:1062 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by syz.5.1717/15100: #0: ff1100002f4b00b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x1d5/0x1590 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0xf0/0x120 lib/dump_stack.c:120 lockdep_rcu_suspicious+0x1e3/0x270 kernel/locking/lockdep.c:6824 __kvm_memslots include/linux/kvm_host.h:1062 [inline] __kvm_memslots include/linux/kvm_host.h:1059 [inline] kvm_vcpu_memslots include/linux/kvm_host.h:1076 [inline] kvm_vcpu_gfn_to_memslot+0x518/0x5e0 virt/kvm/kvm_main.c:2617 kvm_vcpu_read_guest_page+0x27/0x50 virt/kvm/kvm_main.c:3302 load_pdptrs+0xff/0x4b0 arch/x86/kvm/x86.c:1065 svm_cache_reg+0x1c9/0x230 arch/x86/kvm/svm/svm.c:1688 kvm_pdptr_read arch/x86/kvm/kvm_cache_regs.h:141 [inline] __get_sregs2 arch/x86/kvm/x86.c:11784 [inline] kvm_arch_vcpu_ioctl+0x3e20/0x4aa0 arch/x86/kvm/x86.c:6279 kvm_vcpu_ioctl+0x856/0x1590 virt/kvm/kvm_main.c:4663 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl fs/ioctl.c:893 [inline] __x64_sys_ioctl+0x18b/0x210 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xbd/0x1d0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Fixes: `6dba940352` ("KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2") Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org> Link: https://patch.msgid.link/20260123222801.646123-1-kovalev@altlinux.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-23 15:09:59 -08:00
Jun Miao	de0dc71188	KVM: x86: align the code with kvm_x86_call() The use of static_call_cond() is essentially the same as static_call() on x86 (e.g. static_call() now handles a NULL pointer as a NOP), and then the kvm_x86_call() is added to improve code readability and maintainability for keeping consistent code style. Fixes `8d032b683c` ("KVM: TDX: create/destroy VM structure") Link: https://lore.kernel.org/all/3916caa1dcd114301a49beafa5030eca396745c1.1679456900.git.jpoimboe@kernel.org/ Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com Signed-off-by: Jun Miao <jun.miao@intel.com> Link: https://patch.msgid.link/20260105065423.1870622-1-jun.miao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:36:44 -08:00
Sean Christopherson	ead63640d4	KVM: x86: Ignore -EBUSY when checking nested events from vcpu_block() Ignore -EBUSY when checking nested events after exiting a blocking state while L2 is active, as exiting to userspace will generate a spurious userspace exit, usually with KVM_EXIT_UNKNOWN, and likely lead to the VM's demise. Continuing with the wakeup isn't perfect either, as something has gone sideways if a vCPU is awakened in L2 with an injected event (or worse, a nested run pending), but continuing on gives the VM a decent chance of surviving without any major side effects. As explained in the Fixes commits, it _should_ be impossible for a vCPU to be put into a blocking state with an already-injected event (exception, IRQ, or NMI). Unfortunately, userspace can stuff MP_STATE and/or injected events, and thus put the vCPU into what should be an impossible state. Don't bother trying to preserve the WARN, e.g. with an anti-syzkaller Kconfig, as WARNs can (hopefully) be added in paths where _KVM_ would be violating x86 architecture, e.g. by WARNing if KVM attempts to inject an exception or interrupt while the vCPU isn't running. Cc: Alessandro Ratti <alessandro@0x65c.net> Cc: stable@vger.kernel.org Fixes: `26844fee6a` ("KVM: x86: never write to memory from kvm_vcpu_check_block()") Fixes: `45405155d8` ("KVM: x86: WARN if a vCPU gets a valid wakeup that KVM can't yet inject") Link: https://syzkaller.appspot.com/text?tag=ReproC&x=10d4261a580000 Reported-by: syzbot+1522459a74d26b0ac33a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/671bc7a7.050a0220.455e8.022a.GAE@google.com Link: https://patch.msgid.link/20260109030657.994759-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:36:09 -08:00
Sean Christopherson	000d75b0b1	KVM: x86: Update APICv ISR (a.k.a. SVI) as part of kvm_apic_update_apicv() Fold the calls to .hwapic_isr_update() in kvm_apic_set_state(), kvm_lapic_reset(), and __kvm_vcpu_update_apicv() into kvm_apic_update_apicv(), as updating SVI is directly related to updating KVM's own cache of ISR information, e.g. SVI is more or less the APICv equivalent of highest_isr_cache. Note, calling .hwapic_isr_update() during kvm_apic_update_apicv() has benign side effects, as doing so changes the orders of the calls in kvm_lapic_reset() and kvm_apic_set_state(), specifically with respect to to the order between .hwapic_isr_update() and .apicv_post_state_restore(). However, the changes in ordering are glorified nops as the former hook is VMX-only and the latter is SVM-only. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20260109034532.1012993-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:35:32 -08:00
Sean Christopherson	1a5d7f9540	KVM: x86: Add a wrapper to handle common case of IRQ delivery without dest_map Turn kvm_irq_delivery_to_apic() into a wrapper that passes NULL for the @dest_map param, as only the ugly I/O APIC RTC hackery needs to know which vCPUs received the IRQ. No functional change intended. Link: https://patch.msgid.link/20251206004311.479939-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-12 09:31:40 -08:00
Sean Christopherson	4d846f1838	KVM: x86: Drop MAX_NR_RESERVED_IOAPIC_PINS, use KVM_MAX_IRQ_ROUTES directly Directly use KVM_MAX_IRQ_ROUTES when checking the number of routes being defined by userspace when creating a split IRQCHIP. The restriction has nothing to do with the I/O APIC, e.g. most modern userspace usage is for routing MSIs. Breaking the unnecessary dependency on the I/O APIC will allow burying all of ioapic.h behind CONFIG_KVM_IOAPIC=y. No functional change intended. Link: https://patch.msgid.link/20251206004311.479939-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-12 09:31:39 -08:00
Sean Christopherson	b45f721775	x86/fpu: Clear XSTATE_BV[i] in guest XSAVE state whenever XFD[i]=1 When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in response to a guest WRMSR, clear XFD-disabled features in the saved (or to be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for features that are disabled via the guest's XFD. Because the kernel executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1 will cause XRSTOR to #NM and panic the kernel. E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV: ------------[ cut here ]------------ WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848 Modules linked in: kvm_intel kvm irqbypass CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:exc_device_not_available+0x101/0x110 Call Trace: <TASK> asm_exc_device_not_available+0x1a/0x20 RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90 switch_fpu_return+0x4a/0xb0 kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm] kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x62/0x940 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> ---[ end trace 0000000000000000 ]--- This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1, and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's call to fpu_update_guest_xfd(). and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE: ------------[ cut here ]------------ WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867 Modules linked in: kvm_intel kvm irqbypass CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:exc_device_not_available+0x101/0x110 Call Trace: <TASK> asm_exc_device_not_available+0x1a/0x20 RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90 fpu_swap_kvm_fpstate+0x6b/0x120 kvm_load_guest_fpu+0x30/0x80 [kvm] kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm] kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x62/0x940 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> ---[ end trace 0000000000000000 ]--- The new behavior is consistent with the AMX architecture. Per Intel's SDM, XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD (and non-compacted XSAVE saves the initial configuration of the state component): If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i, the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1; instead, it operates as if XINUSE[i] = 0 (and the state component was in its initial state): it saves bit i of XSTATE_BV field of the XSAVE header as 0; in addition, XSAVE saves the initial configuration of the state component (the other instructions do not save state component i). Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using a constant XFD based on the set of enabled features when XSAVEing for a struct fpu_guest. However, having XSTATE_BV[i]=1 for XFD-disabled features can only happen in the above interrupt case, or in similar scenarios involving preemption on preemptible kernels, because fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the outgoing FPU state with the current XFD; and that is (on all but the first WRMSR to XFD) the guest XFD. Therefore, XFD can only go out of sync with XSTATE_BV in the above interrupt case, or in similar scenarios involving preemption on preemptible kernels, and it we can consider it (de facto) part of KVM ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features. Reported-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Fixes: `820a6ee944` ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14) Signed-off-by: Sean Christopherson <seanjc@google.com> [Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate to kvm_vcpu_ioctl_x86_set_xsave. - Paolo] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-01-10 07:17:16 +01:00
Amit Shah	db5e824964	KVM: SVM: Virtualize and advertise support for ERAPS AMD CPUs with the Enhanced Return Address Predictor Security (ERAPS) feature (available on Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after VMEXITs. ERAPS adds guest/host tags to entries in the RSB (a.k.a. RAP). This helps with speculation protection across the VM boundary, and it also preserves host and guest entries in the RSB that can improve software performance (which would otherwise be flushed due to the FILL_RETURN_BUFFER sequences). Importantly, ERAPS also improves cross-domain security by clearing the RAP in certain situations. Specifically, the RAP is cleared in response to actions that are typically tied to software context switching between tasks. Per the APM: The ERAPS feature eliminates the need to execute CALL instructions to clear the return address predictor in most cases. On processors that support ERAPS, return addresses from CALL instructions executed in host mode are not used in guest mode, and vice versa. Additionally, the return address predictor is cleared in all cases when the TLB is implicitly invalidated and in the following cases: • MOV CR3 instruction • INVPCID other than single address invalidation (operation type 0) ERAPS also allows CPUs to extends the size of the RSB/RAP from the older standard (of 32 entries) to a new size, enumerated in CPUID leaf 0x80000021:EBX bits 23:16 (64 entries in Zen5 CPUs). In hardware, ERAPS is always-on, when running in host context, the CPU uses the full RSB/RAP size without any software changes necessary. However, when running in guest context, the CPU utilizes the full size of the RSB/RAP if and only if the new ALLOW_LARGER_RAP flag is set in the VMCB; if the flag is not set, the CPU limits itself to the historical size of 32 entires. Requiring software to opt-in for guest usage of RAPs larger than 32 entries allows hypervisors, i.e. KVM, to emulate the aforementioned conditions in which the RAP is cleared as well as the guest/host split. E.g. if the CPU unconditionally used the full RAP for guests, failure to clear the RAP on transitions between L1 or L2, or on emulated guest TLB flushes, would expose the guest to RAP-based attacks as a guest without support for ERAPS wouldn't know that its FILL_RETURN_BUFFER sequence is insufficient. Address the ~two broad categories of ERAPS emulation, and advertise ERAPS support to userspace, along with the RAP size enumerated in CPUID. 1. Architectural RAP clearing: as above, CPUs with ERAPS clear RAP entries on several conditions, including CR3 updates. To handle scenarios where a relevant operation is handled in common code (emulation of INVPCID and to a lesser extent MOV CR3), piggyback VCPU_EXREG_CR3 and create an alias, VCPU_EXREG_ERAPS. SVM doesn't utilize CR3 dirty tracking, and so for all intents and purposes VCPU_EXREG_CR3 is unused. Aliasing VCPU_EXREG_ERAPS ensures that any flow that writes CR3 will also clear the guest's RAP, and allows common x86 to mark ERAPS vCPUs as needing a RAP clear without having to add a new request (or other mechanism). 2. Nested guests: the ERAPS feature adds host/guest tagging to entries in the RSB, but does not distinguish between the guest ASIDs. To prevent the case of an L2 guest poisoning the RSB to attack the L1 guest, the CPU exposes a new VMCB bit (CLEAR_RAP). The next VMRUN with a VMCB that has this bit set causes the CPU to flush the RSB before entering the guest context. Set the bit in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest runs, its RSB contents aren't polluted by the L2's contents. Similarly, before entry into a nested guest, set the bit for VMCB02, so that the L1 guest's RSB contents are not leaked/used in the L2 context. Enable ALLOW_LARGER_RAP (and emulate RAP clears) if and only if ERAPS is exposed to the guest. Enabling ALLOW_LARGER_RAP unconditionally wouldn't cause any functional issues, but ignoring userspace's (and L1's) desires would put KVM into a grey area, which is especially undesirable due to the potential security implications. E.g. if a use case wants to have L1 do manual RAP clearing even when ERAPS is present in hardware, enabling ALLOW_LARGER_RAP could result in L1 leaving stale entries in the RAP. ERAPS is documented in AMD APM Vol 2 (Pub 24593), in revisions 3.43 and later. Signed-off-by: Amit Shah <amit.shah@amd.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Amit Shah <amit.shah@amd.com> Link: https://patch.msgid.link/aR913X8EqO6meCqa@google.com	2026-01-08 12:12:12 -08:00
Sean Christopherson	5bb9ac1865	KVM: x86: Return "unsupported" instead of "invalid" on access to unsupported PV MSR Return KVM_MSR_RET_UNSUPPORTED instead of '1' (which for all intents and purposes means "invalid") when rejecting accesses to KVM PV MSRs to adhere to KVM's ABI of allowing host reads and writes of '0' to MSRs that are advertised to userspace via KVM_GET_MSR_INDEX_LIST, even if the vCPU model doesn't support the MSR. E.g. running a QEMU VM with -cpu host,-kvmclock,kvm-pv-enforce-cpuid yields: qemu: error: failed to set MSR 0x12 to 0x0 qemu: target/i386/kvm/kvm.c:3301: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. Fixes: `66570e966d` ("kvm: x86: only provide PV features if enabled in guest's CPUID") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20251230205948.4094097-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:59:53 -08:00
Sean Christopherson	f7a65e58d6	KVM: x86/pmu: Disallow emulation in the fastpath if mediated PMCs are active Don't handle exits in the fastpath if emulation is required, i.e. if an instruction needs to be skipped, the mediated PMU is enabled, and one or more PMCs is counting instructions. With the mediated PMU, KVM's cache of PMU state is inconsistent with respect to hardware until KVM exits the inner run loop (when the mediated PMU is "put"). Reviewed-by: Sandipan Das <sandipan.das@amd.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:12 -08:00
Dapeng Mi	56bb273697	KVM: x86/pmu: Load/put mediated PMU context when entering/exiting guest Implement the PMU "world switch" between host perf and guest mediated PMU. When loading guest state, call into perf to switch from host to guest, and then load guest state into hardware, and then reverse those actions when putting guest state. On the KVM side, when loading guest state, zero PERF_GLOBAL_CTRL to ensure all counters are disabled, then load selectors and counters, and finally call into vendor code to load control/status information. While VMX and SVM use different mechanisms to avoid counting host activity while guest controls are loaded, both implementations require PERF_GLOBAL_CTRL to be zeroed when the event selectors are in flux. When putting guest state, reverse the order, and save and zero controls and status prior to saving+zeroing selectors and counters. Defer clearing PERF_GLOBAL_CTRL to vendor code, as only SVM needs to manually clear the MSR; VMX configures PERF_GLOBAL_CTRL to be atomically cleared by the CPU on VM-Exit. Handle the difference in MSR layouts between Intel and AMD by communicating the bases and stride via kvm_pmu_ops. Because KVM requires Intel v4 (and full-width writes) and AMD v2, the MSRs to load/save are constant for a given vendor, i.e. do not vary based on the guest PMU, and do not vary based on host PMU (because KVM will simply disable mediated PMU support if the necessary MSRs are unsupported). Except for retrieving the guest's PERF_GLOBAL_CTRL, which needs to be read before invoking any fastpath handler (spoiler alert), perform the context switch around KVM's inner run loop. State only needs to be synchronized from hardware before KVM can access the software "caches". Note, VMX already grabs the guest's PERF_GLOBAL_CTRL immediately after VM-Exit, as hardware saves value into the VMCS. Co-developed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:11 -08:00
Dapeng Mi	8062427212	KVM: x86/pmu: Disable RDPMC interception for compatible mediated vPMU Disable RDPMC interception for vCPUs with a mediated vPMU that is compatible with the host PMU, i.e. that doesn't require KVM emulation of RDPMC to honor the guest's vCPU model. With a mediated vPMU, all guest state accessible via RDPMC is loaded into hardware while the guest is running. Adust RDPMC interception only for non-TDX guests, as the TDX module is responsible for managing RDPMC intercepts based on the TD configuration. Co-developed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:06 -08:00
Xiong Zhang	1c4ba7286a	KVM: x86/pmu: Register PMI handler for mediated vPMU Register a dedicated PMI handler with perf's callback when mediated PMU support is enabled. Perf routes PMIs that arrive while guest context is loaded to the provided callback, by modifying the CPU's LVTPC to point at a dedicated mediated PMI IRQ vector. WARN upon receipt of a mediated PMI if there is no active vCPU, or if the vCPU doesn't have a mediated PMU. Even if a PMI manages to skid past VM-Exit, it should never be delayed all the way beyond unloading the vCPU. And while running vCPUs without a mediated PMU, the LVTPC should never be wired up to the mediated PMI IRQ vector, i.e. should always be routed through perf's NMI handler. Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:06 -08:00
Dapeng Mi	3e51822b2f	KVM: x86/pmu: Start stubbing in mediated PMU support Introduce enable_mediated_pmu as a global variable, with the intent of exposing it to userspace a vendor module parameter, to control and reflect mediated vPMU support. Wire up the perf plumbing to create+release a mediated PMU, but defer exposing the parameter to userspace until KVM support for a mediated PMUs is fully landed. To (a) minimize compatibility issues, (b) to give userspace a chance to opt out of the restrictive side-effects of perf_create_mediated_pmu(), and (c) to avoid adding new dependencies between enabling an in-kernel irqchip and a mediated vPMU, defer "creating" a mediated PMU in perf until the first vCPU is created. Regarding userspace compatibility, an alternative solution would be to make the mediated PMU fully opt-in, e.g. to avoid unexpected failure due to perf_create_mediated_pmu() failing. Ironically, that approach creates an even bigger compatibility issue, as turning on enable_mediated_pmu would silently break VMMs that don't utilize KVM_CAP_PMU_CAPABILITY (well, silently until the guest tried to access PMU assets). Regarding an in-kernel irqchip, create a mediated PMU if and only if the VM has an in-kernel local APIC, as the mediated PMU will take a hard dependency on forwarding PMIs to the guest without bouncing through host userspace. Silently "drop" the PMU instead of rejecting KVM_CREATE_VCPU, as KVM's existing vPMU support doesn't function correctly if the local APIC is emulated by userspace, e.g. PMIs will never be delivered. I.e. it's far, far more likely that rejecting KVM_CREATE_VCPU would cause problems, e.g. for tests or userspace daemons that just want to probe basic KVM functionality. Note! Deliberately make mediated PMU creation "sticky", i.e. don't unwind it on failure to create a vCPU. Practically speaking, there's no harm to having a VM with a mediated PMU and no vCPUs. To avoid an "impossible" VM setup, reject KVM_CAP_PMU_CAPABILITY if a mediated PMU has been created, i.e. don't let userspace disable PMU support after failed vCPU creation (with PMU support enabled). Defer vendor specific requirements and constraints to the future. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Co-developed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:04 -08:00
Sean Christopherson	4b24910c05	KVM: Add a simplified wrapper for registering perf callbacks Add a parameter-less API for registering perf callbacks in anticipation of introducing another x86-only parameter for handling mediated PMU PMIs. No functional change intended. Acked-by: Anup Patel <anup@brainfault.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:51:39 -08:00
Sean Christopherson	da142f3d37	KVM: Remove subtle "struct kvm_stats_desc" pseudo-overlay Remove KVM's internal pseudo-overlay of kvm_stats_desc, which subtly aliases the flexible name[] in the uAPI definition with a fixed-size array of the same name. The unusual embedded structure results in compiler warnings due to -Wflex-array-member-not-at-end, and also necessitates an extra level of dereferencing in KVM. To avoid the "overlay", define the uAPI structure to have a fixed-size name when building for the kernel. Opportunistically clean up the indentation for the stats macros, and replace spaces with tabs. No functional change intended. Reported-by: Gustavo A. R. Silva <gustavoars@kernel.org> Closes: https://lore.kernel.org/all/aPfNKRpLfhmhYqfP@kspp Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> [..] Acked-by: Anup Patel <anup@brainfault.org> Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://patch.msgid.link/20251205232655.445294-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 10:40:48 -08:00
Sean Christopherson	b47b93c15b	KVM: x86: Disallow setting CPUID and/or feature MSRs if L2 is active Extend KVM's restriction on CPUID and feature MSR changes to disallow updates while L2 is active in addition to rejecting updates after the vCPU has run at least once. Like post-run vCPU model updates, attempting to react to model changes while L2 is active is practically infeasible, e.g. KVM would need to do _something_ in response to impossible situations where userspace has a removed a feature that was consumed as parted of nested VM-Enter. In practice, disallowing vCPU model changes while L2 is active is largely uninteresting, as the only way for L2 to be active without the vCPU having run at least once is if userspace stuffed state via KVM_SET_NESTED_STATE. And because KVM_SET_NESTED_STATE can't put the vCPU into L2 without userspace first defining the vCPU model, e.g. to enable SVM/VMX, modifying the vCPU model while L2 is active would require deliberately setting the vCPU model, then loading nested state, and then changing the model. I.e. no sane VMM should run afoul of the new restriction, and any VMM that does encounter problems has likely been running a broken setup for a long time. Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Kevin Cheng <chengkev@google.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230205641.4092235-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-07 12:44:42 -08:00
Paolo Bonzini	0499add8ef	Merge tag 'kvm-x86-fixes-6.19-rc1' of https://github.com/kvm-x86/linux into HEAD KVM fixes for 6.19-rc1 - Add a missing "break" to fix param parsing in the rseq selftest. - Apply runtime updates to the _current_ CPUID when userspace is setting CPUID, e.g. as part of vCPU hotplug, to fix a false positive and to avoid dropping the pending update. - Disallow toggling KVM_MEM_GUEST_MEMFD on an existing memslot, as it's not supported by KVM and leads to a use-after-free due to KVM failing to unbind the memslot from the previously-associated guest_memfd instance. - Harden against similar KVM_MEM_GUEST_MEMFD goofs, and prepare for supporting flags-only changes on KVM_MEM_GUEST_MEMFD memlslots, e.g. for dirty logging. - Set exit_code[63:32] to -1 (all 0xffs) when synthesizing a nested SVM_EXIT_ERR (a.k.a. VMEXIT_INVALID) #VMEXIT, as VMEXIT_INVALID is defined as -1ull (a 64-bit value). - Update SVI when activating APICv to fix a bug where a post-activation EOI for an in-service IRQ would effective be lost due to SVI being stale. - Immediately refresh APICv controls (if necessary) on a nested VM-Exit instead of deferring the update via KVM_REQ_APICV_UPDATE, as the request is effectively ignored because KVM thinks the vCPU already has the correct APICv settings.	2025-12-18 18:38:45 +01:00
Dongli Zhang	b2849bec93	KVM: VMX: Update SVI during runtime APICv activation The APICv (apic->apicv_active) can be activated or deactivated at runtime, for instance, because of APICv inhibit reasons. Intel VMX employs different mechanisms to virtualize LAPIC based on whether APICv is active. When APICv is activated at runtime, GUEST_INTR_STATUS is used to configure and report the current pending IRR and ISR states. Unless a specific vector is explicitly included in EOI_EXIT_BITMAP, its EOI will not be trapped to KVM. Intel VMX automatically clears the corresponding ISR bit based on the GUEST_INTR_STATUS.SVI field. When APICv is deactivated at runtime, the VM_ENTRY_INTR_INFO_FIELD is used to specify the next interrupt vector to invoke upon VM-entry. The VMX IDT_VECTORING_INFO_FIELD is used to report un-invoked vectors on VM-exit. EOIs are always trapped to KVM, so the software can manually clear pending ISR bits. There are scenarios where, with APICv activated at runtime, a guest-issued EOI may not be able to clear the pending ISR bit. Taking vector 236 as an example, here is one scenario. 1. Suppose APICv is inactive. Vector 236 is pending in the IRR. 2. To handle KVM_REQ_EVENT, KVM moves vector 236 from the IRR to the ISR, and configures the VM_ENTRY_INTR_INFO_FIELD via vmx_inject_irq(). 3. After VM-entry, vector 236 is invoked through the guest IDT. At this point, the data in VM_ENTRY_INTR_INFO_FIELD is no longer valid. The guest interrupt handler for vector 236 is invoked. 4. Suppose a VM exit occurs very early in the guest interrupt handler, before the EOI is issued. 5. Nothing is reported through the IDT_VECTORING_INFO_FIELD because vector 236 has already been invoked in the guest. 6. Now, suppose APICv is activated. Before the next VM-entry, KVM calls kvm_vcpu_update_apicv() to activate APICv. 7. Unfortunately, GUEST_INTR_STATUS.SVI is not configured, although vector 236 is still pending in the ISR. 8. After VM-entry, the guest finally issues the EOI for vector 236. However, because SVI is not configured, vector 236 is not cleared. 9. ISR is stalled forever on vector 236. Here is another scenario. 1. Suppose APICv is inactive. Vector 236 is pending in the IRR. 2. To handle KVM_REQ_EVENT, KVM moves vector 236 from the IRR to the ISR, and configures the VM_ENTRY_INTR_INFO_FIELD via vmx_inject_irq(). 3. VM-exit occurs immediately after the next VM-entry. The vector 236 is not invoked through the guest IDT. Instead, it is saved to the IDT_VECTORING_INFO_FIELD during the VM-exit. 4. KVM calls kvm_queue_interrupt() to re-queue the un-invoked vector 236 into vcpu->arch.interrupt. A KVM_REQ_EVENT is requested. 5. Now, suppose APICv is activated. Before the next VM-entry, KVM calls kvm_vcpu_update_apicv() to activate APICv. 6. Although APICv is now active, KVM still uses the legacy VM_ENTRY_INTR_INFO_FIELD to re-inject vector 236. GUEST_INTR_STATUS.SVI is not configured. 7. After the next VM-entry, vector 236 is invoked through the guest IDT. Finally, an EOI occurs. However, due to the lack of GUEST_INTR_STATUS.SVI configuration, vector 236 is not cleared from the ISR. 8. ISR is stalled forever on vector 236. Using QEMU as an example, vector 236 is stuck in ISR forever. (qemu) info lapic 1 dumping local APIC state for CPU 1 LVT0 0x00010700 active-hi edge masked ExtINT (vec 0) LVT1 0x00010400 active-hi edge masked NMI LVTPC 0x00000400 active-hi edge NMI LVTERR 0x000000fe active-hi edge Fixed (vec 254) LVTTHMR 0x00010000 active-hi edge masked Fixed (vec 0) LVTT 0x000400ec active-hi edge tsc-deadline Fixed (vec 236) Timer DCR=0x0 (divide by 2) initial_count = 0 current_count = 0 SPIV 0x000001ff APIC enabled, focus=off, spurious vec 255 ICR 0x000000fd physical edge de-assert no-shorthand ICR2 0x00000000 cpu 0 (X2APIC ID) ESR 0x00000000 ISR 236 IRR 37(level) 236 The issue isn't applicable to AMD SVM as KVM simply writes vmcb01 directly irrespective of whether L1 (vmcs01) or L2 (vmcb02) is active (unlike VMX, there is no need/cost to switch between VMCBs). In addition, APICV_INHIBIT_REASON_IRQWIN ensures AMD SVM AVIC is not activated until the last interrupt is EOI'd. Fix the bug by configuring Intel VMX GUEST_INTR_STATUS.SVI if APICv is activated at runtime. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20251110063212.34902-1-dongli.zhang@oracle.com [sean: call out that SVM writes vmcb01 directly, tweak comment] Link: https://patch.msgid.link/20251205231913.441872-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-12-08 06:56:29 -08:00
Paolo Bonzini	679fcce002	Merge tag 'kvm-x86-svm-6.19' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 6.19: - Fix a few missing "VMCB dirty" bugs. - Fix the worst of KVM's lack of EFER.LMSLE emulation. - Add AVIC support for addressing 4k vCPUs in x2AVIC mode. - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions. - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT. - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3. - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM.	2025-11-26 09:48:39 +01:00
Paolo Bonzini	de8e8ebb1a	Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEAD KVM TDX changes for 6.19: - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module, which KVM was either working around in weird, ugly ways, or was simply oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever selftests). - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if creating said vCPU failed partway through. - Fix a few sparse warnings (bad annotation, 0 != NULL). - Use struct_size() to simplify copying capabilities to userspace.	2025-11-26 09:36:37 +01:00
Paolo Bonzini	e64dcfab57	Merge tag 'kvm-x86-misc-6.19' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.19: - Fix an async #PF bug where KVM would clear the completion queue when the guest transitioned in and out of paging mode, e.g. when handling an SMI and then returning to paged mode via RSM. - Fix a bug where TDX would effectively corrupt user-return MSR values if the TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected. - Leave the user-return notifier used to restore MSRs registered when disabling virtualization, and instead pin kvm.ko. Restoring host MSRs via IPI callback is either pointless (clean reboot) or dangerous (forced reboot) since KVM has no idea what code it's interrupting. - Use the checked version of {get,put}_user(), as Linus wants to kill them off, and they're measurably faster on modern CPUs due to the unchecked versions containing an LFENCE. - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC timers can result in a hard lockup in the host. - Revert the periodic kvmclock sync logic now that KVM doesn't use a clocksource that's subject to NPT corrections. - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter behind CONFIG_CPU_MITIGATIONS. - Context switch XCR0, XSS, and PKRU outside of the entry/exit fastpath as the only reason they were handled in the faspath was to paper of a bug in the core #MC code that has long since been fixed. - Add emulator support for AVX MOV instructions to play nice with emulated devices whose PCI BARs guest drivers like to access with large multi-byte instructions.	2025-11-26 09:34:21 +01:00
Paolo Bonzini	f106797f81	KVM: x86: Add x86_emulate_ops.get_xcr() callback This will be necessary in order to check whether AVX is enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Link: https://patch.msgid.link/20251114003633.60689-7-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-19 14:29:55 -08:00
Binbin Wu	0b28f21ad4	KVM: x86: Add a helper to dedup loading guest/host XCR0 and XSS Add and use a helper, kvm_load_xfeatures(), to dedup the code that loads guest/host xfeatures. Opportunistically return early if X86_CR4_OSXSAVE is not set to reduce indentations. No functional change intended. Suggested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20251110050539.3398759-1-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-19 05:41:12 -08:00
Sean Christopherson	7649412af3	KVM: x86: Load guest/host PKRU outside of the fastpath run loop Move KVM's swapping of PKRU outside of the fastpath loop, as there is no KVM code anywhere in the fastpath that accesses guest/userspace memory, i.e. that can consume protection keys. As documented by commit `1be0e61c1f` ("KVM, pkeys: save/restore PKRU when guest/host switches"), KVM just needs to ensure the host's PKRU is loaded when KVM (or the kernel at-large) may access userspace memory. And at the time of commit `1be0e61c1f`, KVM didn't have a fastpath, and PKU was strictly contained to VMX, i.e. there was no reason to swap PKRU outside of vmx_vcpu_run(). Over time, the "need" to swap PKRU close to VM-Enter was likely falsely solidified by the association with XFEATUREs in commit `37486135d3` ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c"), and XFEATURE swapping was in turn moved close to VM-Enter/VM-Exit as a KVM hack-a-fix ution for an #MC handler bug by commit `1811d979c7` ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"). Deferring the PKRU loads shaves ~40 cycles off the fastpath for Intel, and ~60 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c, with extra hacks to enable CR4.PKE and PKRU=(-1u & ~0x3), latency numbers for AMD Turin go from ~1560 => ~1500, and for Intel Emerald Rapids, go from ~810 => ~770. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://patch.msgid.link/20251118222328.2265758-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-19 05:41:11 -08:00
Sean Christopherson	75c69c82f2	KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop Move KVM's swapping of XFEATURE masks, i.e. XCR0 and XSS, out of the fastpath loop now that the guts of the #MC handler runs in task context, i.e. won't invoke schedule() with preemption disabled and clobber state (or crash the kernel) due to trying to context switch XSTATE with a mix of host and guest state. For all intents and purposes, this reverts commit `1811d979c7` ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context"), which papered over an egregious bug/flaw in the #MC handler where it would do schedule() even though IRQs are disabled. E.g. the call stack from the commit: kvm_load_guest_xcr0 ... kvm_x86_ops->run(vcpu) vmx_vcpu_run vmx_complete_atomic_exit kvm_machine_check do_machine_check do_memory_failure memory_failure lock_page Commit `1811d979c7` "fixed" the immediate issue of XRSTORS exploding, but completely ignored that scheduling out a vCPU task while IRQs and preemption is wildly broken. Thankfully, commit `5567d11c21` ("x86/mce: Send #MC singal from task work") (somewhat incidentally?) fixed that flaw by pushing the meat of the work to the user-return path, i.e. to task context. KVM has also hardened itself against #MC goofs by moving #MC forwarding to kvm_x86_ops.handle_exit_irqoff(), i.e. out of the fastpath. While that's by no means a robust fix, restoring as much state as possible before handling the #MC will hopefully provide some measure of protection in the event that #MC handling goes off the rails again. Note, KVM always intercepts XCR0 writes for vCPUs without protected state, e.g. there's no risk of consuming a stale XCR0 when determining if a PKRU update is needed; kvm_load_host_xfeatures() only reads, and never writes, vcpu->arch.xcr0. Deferring the XCR0 and XSS loads shaves ~300 cycles off the fastpath for Intel, and ~500 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c, which an extra hack to enable CR4.OXSAVE, latency numbers for AMD Turin go from ~2000 => 1500, and for Intel Emerald Rapids, go from ~1300 => ~1000. Cc: Jon Kohler <jon@nutanix.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://patch.msgid.link/20251118222328.2265758-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-19 05:41:10 -08:00
Brendan Jackman	38ee66cb18	KVM: x86: Unify L1TF flushing under per-CPU variable Currently the tracking of the need to flush L1D for L1TF is tracked by two bits: one per-CPU and one per-vCPU. The per-vCPU bit is always set when the vCPU shows up on a core, so there is no interesting state that's truly per-vCPU. Indeed, this is a requirement, since L1D is a part of the physical CPU. So simplify this by combining the two bits. The vCPU bit was being written from preemption-enabled regions. To play nice with those cases, wrap all calls from KVM and use a raw write so that request a flush with preemption enabled doesn't trigger what would effectively be DEBUG_PREEMPT false positives. Preemption doesn't need to be disabled, as kvm_arch_vcpu_load() will mark the new CPU as needing a flush if the vCPU task is migrated, or if userspace runs the vCPU on a different task. Signed-off-by: Brendan Jackman <jackmanb@google.com> [sean: put raw write in KVM instead of in a hardirq.h variant] Link: https://patch.msgid.link/20251113233746.1703361-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-18 16:22:45 -08:00
Chao Gao	11d984633f	KVM: x86: Allocate/free user_return_msrs at kvm.ko (un)loading time Move user_return_msrs allocation/free from vendor modules (kvm-intel.ko and kvm-amd.ko) (un)loading time to kvm.ko's to make it less risky to access user_return_msrs in kvm.ko. Tying the lifetime of user_return_msrs to vendor modules makes every access to user_return_msrs prone to use-after-free issues as vendor modules may be unloaded at any time. Opportunistically turn the per-CPU variable into full structs, as there's no practical difference between statically allocating the memory and allocating it unconditionally during module_init(). Zero out kvm_nr_uret_msrs on vendor module exit to further minimize the chances of consuming stale data, and WARN on vendor module load if KVM thinks there are existing user-return MSRs. Note! The user-return MSRs also need to be "destroyed" if ops->hardware_setup() fails, as both SVM and VMX expect common KVM to clean up (because common code, not vendor code, is responsible for kvm_nr_uret_msrs). Signed-off-by: Chao Gao <chao.gao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251108013601.902918-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-18 15:53:54 -08:00
Lei Chen	e78fb96b41	KVM: x86: remove comment about ntp correction sync for Since vcpu local clock is no longer affected by ntp, remove comment about ntp correction sync for function kvm_gen_kvmclock_update. Signed-off-by: Lei Chen <lei.chen@smartx.com> Link: https://patch.msgid.link/20250819152027.1687487-4-lei.chen@smartx.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-17 07:50:25 -08:00
Lei Chen	446fcce2a5	Revert "x86: kvm: rate-limit global clock updates" This reverts commit `7e44e4495a`. Commit `7e44e4495a` ("x86: kvm: rate-limit global clock updates") intends to use a kvmclock_update_work to sync ntp corretion across all vcpus kvmclock, which is based on commit `0061d53daf` ("KVM: x86: limit difference between kvmclock updates") Since kvmclock has been switched to mono raw, this commit can be reverted. Signed-off-by: Lei Chen <lei.chen@smartx.com> Link: https://patch.msgid.link/20250819152027.1687487-3-lei.chen@smartx.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-17 07:50:24 -08:00
Lei Chen	43ddbf16ed	Revert "x86: kvm: introduce periodic global clock updates" This reverts commit `332967a3ea`. Commit `332967a3ea` ("x86: kvm: introduce periodic global clock updates") introduced a 300s interval work to sync ntp corrections across all vcpus. Since commit `53fafdbb8b` ("KVM: x86: switch KVMCLOCK base to monotonic raw clock"), kvmclock switched to mono raw clock, we can no longer take ntp into consideration. Signed-off-by: Lei Chen <lei.chen@smartx.com> Link: https://patch.msgid.link/20250819152027.1687487-2-lei.chen@smartx.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-17 07:50:23 -08:00
Omar Sandoval	4da3768e18	KVM: SVM: Don't skip unrelated instruction if INT3/INTO is replaced When re-injecting a soft interrupt from an INT3, INT0, or (select) INTn instruction, discard the exception and retry the instruction if the code stream is changed (e.g. by a different vCPU) between when the CPU executes the instruction and when KVM decodes the instruction to get the next RIP. As effectively predicted by commit `6ef88d6e36` ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction"), failure to verify that the correct INTn instruction was decoded can effectively clobber guest state due to decoding the wrong instruction and thus specifying the wrong next RIP. The bug most often manifests as "Oops: int3" panics on static branch checks in Linux guests. Enabling or disabling a static branch in Linux uses the kernel's "text poke" code patching mechanism. To modify code while other CPUs may be executing that code, Linux (temporarily) replaces the first byte of the original instruction with an int3 (opcode 0xcc), then patches in the new code stream except for the first byte, and finally replaces the int3 with the first byte of the new code stream. If a CPU hits the int3, i.e. executes the code while it's being modified, then the guest kernel must look up the RIP to determine how to handle the #BP, e.g. by emulating the new instruction. If the RIP is incorrect, then this lookup fails and the guest kernel panics. The bug reproduces almost instantly by hacking the guest kernel to repeatedly check a static branch[1] while running a drgn script[2] on the host to constantly swap out the memory containing the guest's TSS. [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b Fixes: `6ef88d6e36` ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Cc: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://patch.msgid.link/1cc6dcdf36e3add7ee7c8d90ad58414eeb6c3d34.1762278762.git.osandov@fb.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-13 13:03:19 -08:00
Hou Wenlong	995d504100	KVM: x86: Don't disable IRQs when unregistering user-return notifier Remove the code to disable IRQs when unregistering KVM's user-return notifier now that KVM doesn't invoke kvm_on_user_return() when disabling virtualization via IPI function call, i.e. now that there's no need to guard against re-entrancy via IPI callback. Note, disabling IRQs has largely been unnecessary since commit `a377ac1cd9` ("x86/entry: Move user return notifier out of loop") moved fire_user_return_notifiers() into the section with IRQs disabled. In doing so, the commit somewhat inadvertently fixed the underlying issue that was papered over by commit `1650b4ebc9` ("KVM: Disable irq while unregistering user notifier"). I.e. in practice, the code and comment has been stale since commit `a377ac1cd9`. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> [sean: rewrite changelog after rebasing, drop lockdep assert] Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030191528.3380553-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-07 10:59:47 -08:00
Sean Christopherson	2baa33a8dd	KVM: x86: Leave user-return notifier registered on reboot/shutdown Leave KVM's user-return notifier registered in the unlikely case that the notifier is registered when disabling virtualization via IPI callback in response to reboot/shutdown. On reboot/shutdown, keeping the notifier registered is ok as far as MSR state is concerned (arguably better then restoring MSRs at an unknown point in time), as the callback will run cleanly and restore host MSRs if the CPU manages to return to userspace before the system goes down. The only wrinkle is that if kvm.ko module unload manages to race with reboot/shutdown, then leaving the notifier registered could lead to use-after-free due to calling into unloaded kvm.ko module code. But such a race is only possible on --forced reboot/shutdown, because otherwise userspace tasks would be frozen before kvm_shutdown() is called, i.e. on a "normal" reboot/shutdown, it should be impossible for the CPU to return to userspace after kvm_shutdown(). Furthermore, on a --forced reboot/shutdown, unregistering the user-return hook from IRQ context doesn't fully guard against use-after-free, because KVM could immediately re-register the hook, e.g. if the IRQ arrives before kvm_user_return_register_notifier() is called. Rather than trying to guard against the IPI in the "normal" user-return code, which is difficult and noisy, simply leave the user-return notifier registered on a reboot, and bump the kvm.ko module refcount to defend against a use-after-free due to kvm.ko unload racing against reboot. Alternatively, KVM could allow kvm.ko and try to drop the notifiers during kvm_x86_exit(), but that's also a can of worms as registration is per-CPU, and so KVM would need to blast an IPI, and doing so while a reboot/shutdown is in-progress is far risky than preventing userspace from unloading KVM. Link: https://patch.msgid.link/20251030191528.3380553-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-07 10:59:46 -08:00
Sean Christopherson	b371174d2f	KVM: x86: WARN if user-return MSR notifier is registered on exit When freeing the per-CPU user-return MSRs structures, WARN if any CPU has a registered notifier to help detect and/or debug potential use-after-free issues. The lifecycle of the notifiers is rather convoluted, and has several non-obvious paths where notifiers are unregistered, i.e. isn't exactly the most robust code possible. The notifiers they are registered on-demand in KVM, on the first WRMSR to a tracked register. _Usually_ the notifier is unregistered whenever the CPU returns to userspace. But because any given CPU isn't guaranteed to return to userspace, e.g. the CPU could be offlined before doing so, KVM also "drops", a.k.a. unregisters, the notifiers when virtualization is disabled on the CPU. Further complicating the unregister path is the fact that the calls to disable virtualization come from common KVM, and the per-CPU calls are guarded by a per-CPU flag (to harden _that_ code against bugs, e.g. due to mishandling reboot). Reboot/shutdown in particular is problematic, as KVM disables virtualization via IPI function call, i.e. from IRQ context, instead of using the cpuhp framework, which runs in task context. I.e. on reboot/shutdown, drop_user_return_notifiers() is called asynchronously. Forced reboot/shutdown is the most problematic scenario, as userspace tasks are not frozen before kvm_shutdown() is invoked, i.e. KVM could be actively manipulating the user-return MSR lists and/or notifiers when the IPI arrives. To a certain extent, all bets are off when userspace forces a reboot/shutdown, but KVM should at least avoid a use-after-free, e.g. to avoid crashing the kernel when trying to reboot. Link: https://patch.msgid.link/20251030191528.3380553-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-07 10:59:46 -08:00
Sean Christopherson	c0711f8c61	KVM: TDX: Explicitly set user-return MSRs that may be clobbered by the TDX-Module Set all user-return MSRs to their post-TD-exit value when preparing to run a TDX vCPU to ensure the value that KVM expects to be loaded after running the vCPU is indeed the value that's loaded in hardware. If the TDX-Module doesn't actually enter the guest, i.e. doesn't do VM-Enter, then it won't "restore" VMM state, i.e. won't clobber user-return MSRs to their expected post-run values, in which case simply updating KVM's "cached" value will effectively corrupt the cache due to hardware still holding the original value. In theory, KVM could conditionally update the current user-return value if and only if tdh_vp_enter() succeeds, but in practice "success" doesn't guarantee the TDX-Module actually entered the guest, e.g. if the TDX-Module synthesizes an EPT Violation because it suspects a zero-step attack. Force-load the expected values instead of trying to decipher whether or not the TDX-Module restored/clobbered MSRs, as the risk doesn't justify the benefits. Effectively avoiding four WRMSRs once per run loop (even if the vCPU is scheduled out, user-return MSRs only need to be reloaded if the CPU exits to userspace or runs a non-TDX vCPU) is likely in the noise when amortized over all entries, given the cost of running a TDX vCPU. E.g. the cost of the WRMSRs is somewhere between ~300 and ~500 cycles, whereas the cost of a _single_ roundtrip to/from a TDX guest is thousands of cycles. Fixes: `e0b4f31a3c` ("KVM: TDX: restore user ret MSRs") Cc: stable@vger.kernel.org Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20251030191528.3380553-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-07 10:59:45 -08:00
Maxim Levitsky	ab4e41eb9f	KVM: x86: Don't clear async #PF queue when CR0.PG is disabled (e.g. on #SMI) Fix an interaction between SMM and PV asynchronous #PFs where an #SMI can cause KVM to drop an async #PF ready event, and thus result in guest tasks becoming permanently stuck due to the task that encountered the #PF never being resumed. Specifically, don't clear the completion queue when paging is disabled, and re-check for completed async #PFs if/when paging is enabled. Prior to commit `2635b5c4a0` ("KVM: x86: interrupt based APF 'page ready' event delivery"), flushing the APF queue without notifying the guest of completed APF requests when paging is disabled was "necessary", in that delivering a #PF to the guest when paging is disabled would likely confuse and/or crash the guest. And presumably the original async #PF development assumed that a guest would only disable paging when there was no intent to ever re-enable paging. That assumption fails in several scenarios, most visibly on an emulated SMI, as entering SMM always disables CR0.PG (i.e. initially runs with paging disabled). When the SMM handler eventually executes RSM, the interrupted paging-enabled is restored, and the async #PF event is lost. Similarly, invoking firmware, e.g. via EFI runtime calls, might require a transition through paging modes and thus also disable paging with valid entries in the competion queue. To avoid dropping completion events, drop the "clear" entirely, and handle paging-enable transitions in the same way KVM already handles APIC enable/disable events: if a vCPU's APIC is disabled, APF completion events are not kept pending and not injected while APIC is disabled. Once a vCPU's APIC is re-enabled, KVM raises KVM_REQ_APF_READY so that the vCPU recognizes any pending pending #APF ready events. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251015033258.50974-4-mlevitsk@redhat.com [sean: rework changelog to call out #PF injection, drop "real mode" references, expand the code comment] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-07 10:59:44 -08:00
Maxim Levitsky	68c35f89d0	KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued() Fix a semi theoretical race condition related to a lack of memory barriers when dealing with vcpu->arch.apf.pageready_pending. In theory, the "ready" side could see a stale pageready_pending and neglect to kick the vCPU, and thus allow the vCPU to enter the guest with a pending KVM_REQ_APF_READY and no kick/IPI on the way, in which case the KVM would fail to deliver a completed async #PF event to the guest in a timely manner as the request would be recognized only on the next (coincidental) VM-Exit. kvm_arch_async_page_present_queued() running in workqueue context: kvm_make_request(KVM_REQ_APF_READY, vcpu); /* memory barrier is missing here/ if (!vcpu->arch.apf.pageready_pending) kvm_vcpu_kick(vcpu); kvm_set_msr_common() running in task context: vcpu->arch.apf.pageready_pending = false; / memory barrier is missing here*/ And later, vcpu_enter_guest() running in task context: if (kvm_check_request(KVM_REQ_APF_READY, vcpu)) kvm_check_async_pf_completion(vcpu) Add missing full memory barriers in both cases to avoid theoretical case of not kicking the vCPU thread. Note that the bug is mostly theoretical because kvm_make_request() uses an atomic operation, which is always serializing on x86, requiring only for documentation purposes the smp_mb__after_atomic() after it (smp_mb__after_atomic() is a NOP on x86). The second missing barrier, between kvm_set_msr_common() and vcpu_enter_guest(), isn't strictly needed because KVM executes several barriers in between calling these functions, however it still makes sense to have an explicit barrier to be on the safe side and to document the ordering dependencies. Finally, also use READ_ONCE/WRITE_ONCE. Thanks a lot to Paolo for the help with this patch. Link: https://lore.kernel.org/all/7c7a5a75-a786-4a05-a836-4368582ca4c2@redhat.com Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://patch.msgid.link/20251015033258.50974-3-mlevitsk@redhat.com [sean: explain the race and its impact in more detail] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-07 10:59:43 -08:00
Sean Christopherson	94428e3ba3	KVM: TDX: Convert INIT_MEM_REGION and INIT_VCPU to "unlocked" vCPU ioctl Handle the KVM_TDX_INIT_MEM_REGION and KVM_TDX_INIT_VCPU vCPU sub-ioctls in the unlocked variant, i.e. outside of vcpu->mutex, in anticipation of taking kvm->lock along with all other vCPU mutexes, at which point the sub-ioctls _must_ start without vcpu->mutex held. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-05 11:17:30 -08:00
Sean Christopherson	50efc2340a	KVM: Rename kvm_arch_vcpu_async_ioctl() to kvm_arch_vcpu_unlocked_ioctl() Rename the "async" ioctl API to "unlocked" so that upcoming usage in x86's TDX code doesn't result in a massive misnomer. To avoid having to retry SEAMCALLs, TDX needs to acquire kvm->lock and all vcpu->mutex locks, and acquiring all of those locks after/inside the current vCPU's mutex is a non-starter. However, TDX also needs to acquire the vCPU's mutex and load the vCPU, i.e. the handling is very much not async to the vCPU. No functional change intended. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-05 11:03:11 -08:00

1 2 3 4 5 ...

3243 Commits