mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
ef901b38d3
2866 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
6231c9e1a9 |
KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu
kvm_vcpu_ioctl_x86_set_vcpu_events() routine makes 'KVM_REQ_NMI'
request for a vcpu even when its 'events->nmi.pending' is zero.
Ex:
qemu_thread_start
kvm_vcpu_thread_fn
qemu_wait_io_event
qemu_wait_io_event_common
process_queued_cpu_work
do_kvm_cpu_synchronize_post_init/_reset
kvm_arch_put_registers
kvm_put_vcpu_events (cpu, level=[2|3])
This leads vCPU threads in QEMU to constantly acquire & release the
global mutex lock, delaying the guest boot due to lock contention.
Add check to make KVM_REQ_NMI request only if vcpu has NMI pending.
Fixes:
|
||
![]() |
7bb7fce136 |
KVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad index
Apply the pre-intercepts RDPMC validity check only to AMD, and rename all
relevant functions to make it as clear as possible that the check is not a
standard PMC index check. On Intel, the basic rule is that only invalid
opcodes and privilege/permission/mode checks have priority over VM-Exit,
i.e. RDPMC with an invalid index should VM-Exit, not #GP. While the SDM
doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a
non-existent MSR as an example where VM-Exit has priority over #GP, and
RDPMC is effectively just a variation of RDMSR.
Manually testing on various Intel CPUs confirms this behavior, and the
inverted priority was introduced for SVM compatibility, i.e. was not an
intentional change for Intel PMUs. On AMD, *all* exceptions on RDPMC have
priority over VM-Exit.
Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0
static call so as to provide a convenient location to document the
difference between Intel and AMD, and to again try to make it as obvious
as possible that the early check is a one-off thing, not a generic "is
this PMC valid?" helper.
Fixes:
|
||
![]() |
955997e880 |
KVM: x86: Use mutex guards to eliminate __kvm_x86_vendor_init()
Use the recently introduced guard(mutex) infrastructure acquire and automatically release vendor_module_lock when the guard goes out of scope. Drop the inner __kvm_x86_vendor_init(), its sole purpose was to simplify releasing vendor_module_lock in error paths. No functional change intended. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20231030141728.1406118-1-nik.borisov@suse.com [sean: rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
09d1c6a80f |
Generic:
- Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation. There are two non-KVM patches buried in the middle of guest_memfd support: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka). -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmWcMWkUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroO15gf/WLmmg3SET6Uzw9iEq2xo28831ZA+ 6kpILfIDGKozV5safDmMvcInlc/PTnqOFrsKyyN4kDZ+rIJiafJdg/loE0kPXBML wdR+2ix5kYI1FucCDaGTahskBDz8Lb/xTpwGg9BFLYFNmuUeHc74o6GoNvr1uliE 4kLZL2K6w0cSMPybUD+HqGaET80ZqPwecv+s1JL+Ia0kYZJONJifoHnvOUJ7DpEi rgudVdgzt3EPjG0y1z6MjvDBXTCOLDjXajErlYuZD3Ej8N8s59Dh2TxOiDNTLdP4 a4zjRvDmgyr6H6sz+upvwc7f4M4p+DBvf+TkWF54mbeObHUYliStqURIoA== =66Ws -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ... |
||
![]() |
b51cc5d028 |
x86/cleanups changes for v6.8:
- A micro-optimization got misplaced as a cleanup: - Micro-optimize the asm code in secondary_startup_64_no_verify() - Change global variables to local - Add missing kernel-doc function parameter descriptions - Remove unused parameter from a macro - Remove obsolete Kconfig entry - Fix comments - Fix typos, mostly scripted, manually reviewed Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb2i8RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iFIQ//RjqKWmEBfv0UVCNgtRgkUKOvYVkfhC1R FykHWbSE+/oDODS7B+gbWqzl9Fq2Oxx9re4KZuMfnojE96KZ6H1flQn7z3UVRUrf pfMx13E+uyf7qbVZktqH38lUS4s/AHdX2PKCiXlU/0hIkiBdjbAl3ylyqMv7ytIL Fi2N9iYJN+eLlMkc3A5IK83xNiU8rb0gO6Uywn3nUbqadY/YX2gDpND5kfzRIneR lTKy4rX3+E65qYB2Ly1wDr7e0Q0rgaTzPctx6twFrxQXK+MsHiartJhM5juND/tU DEjSW9ISOHlitKEJI/zbdrvJlr5AKDNy2zHYmQQuqY6+YHRamCKqwIjLIPkKj52g lAbosNwvp/o8W3zUHgUfVZR5hVxN863zV2qa/ehoQ3b/9kNjQC8actILjYEgIVu9 av1sd+nETbjCUABIF9H9uAoRbgc+wQs2nupJZrjvginFz8+WVhgaBdJDMYCNAmjc fNMjGtRS7YXiIMj09ZAXFThVW302FdbTgggDh/qlQlDOXFu5HRbyuWR+USr4/jkP qs2G6m/BHDs9HxDRo/no+ccSrUBV5phfhZbO7qwjTf2NJJvPHW+cxGpT00zU2v8A lgfVI7SDkxwbyi1gacJ054GqEhsWuEdi40ikqxjhL8Oq4xwwsey/PiaIxjkDQx92 Gj3XUSDnGEs= =kUav -----END PGP SIGNATURE----- Merge tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: - Change global variables to local - Add missing kernel-doc function parameter descriptions - Remove unused parameter from a macro - Remove obsolete Kconfig entry - Fix comments - Fix typos, mostly scripted, manually reviewed and a micro-optimization got misplaced as a cleanup: - Micro-optimize the asm code in secondary_startup_64_no_verify() * tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arch/x86: Fix typos x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify() x86/docs: Remove reference to syscall trampoline in PTI x86/Kconfig: Remove obsolete config X86_32_SMP x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro x86/mtrr: Document missing function parameters in kernel-doc x86/setup: Make relocated_ramdisk a local variable of relocate_initrd() |
||
![]() |
3115d2de39 |
KVM Xen change for 6.8:
To workaround Xen guests that don't expect Xen PV clocks to be marked as being based on a stable TSC, add a Xen config knob to allow userspace to opt out of KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't entirely blameless for the buggy guest behavior. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWXASsSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5R54P/iQPQBs4dJmNkPiA6uSq1O5/8hN4P59z aapJNgDiny/D9/zPbOxGWR31W7lvCgiES/lp3KcHZmwbeAwJpdT6a0cJWGRlGuov gccK8AoYcnwSU98sPisnFv7dJ66ogJfXVkPKKaWo+zVW53XUq2XpIie4eWaOweBt QsXpTGYpGajv1Bf/MgRtNtlkVAo1w8XL1L0NWRugzCk2CAYezz8IT1874GNZoJbd GJfVP+76FdNw+4/CxiaBwxP0gHfBIiAsJzGqbmMPhGG2xJn+KGs5FTEf37Pta8cl aMHAq6/JAoabJfP39MexVkopMaFlPbDwIWfkLWf6wSP86KHei+t9kLC0E4/R2NJ+ GKlrBB6Gj+gzFR4fZ75hIwS/4REMt6zVCbS7uSRrCduqrlEFcY5ED2NesoL9wZrB WMDIxIGIVDdRxc9WLypKmBj7KTgL0qXBxnsAcPiDRf1sk6SGajkesWxA1C1Nzo/H yNfqq0gjdPZVB2RIGN6DpWQFu3d+ZQnG2ToKIBW7OkvJ5USYiDSo4VozhESgYHRZ UJDhJ73QYESynClP6ST+9cxNof3FXCEPDeKr5NcmjVZxlJcdeUDNRqv0LUxQ56BI FvHMHtSs4WLYHZZVzsdh+Yhnc9rEGfoL0NwDPBCcOXjuNMvNQmuzSldc/VDGm/qt sCtxYMms5n7u =3v8F -----END PGP SIGNATURE----- Merge tag 'kvm-x86-xen-6.8' of https://github.com/kvm-x86/linux into HEAD KVM Xen change for 6.8: To workaround Xen guests that don't expect Xen PV clocks to be marked as being based on a stable TSC, add a Xen config knob to allow userspace to opt out of KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't entirely blameless for the buggy guest behavior. |
||
![]() |
8ecb10bcbf |
KVM x86 support for virtualizing Linear Address Masking (LAM)
Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+k4SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5KygQAKTSEmfdox6MSYzGVzAVHBD/8oSTZAGf 4l96Np3sZiX0ujWP7aW1GaIdGL27Yf1bQrKIrODR4xepaosVPpoZZbnLFQ4Jm16D OuwEQL06LV91Lv5XuPkNdq3nMVi1X3wjiKLvP451oCGv8JdxsjXSlFr8ZmDoCfmS NCjkPyitdK+/xOMY5WcrkHD/6VMMiM+5A+CrG7DkaTaqBJQSUXG1NvTKhhxey6Rq OZv0GPv7QVMhHv1NX0Y3LyoiGyWXAoFRnbk/N3yVBOnXcpJ+HBwWiNLRpxmZOQj/ CTo0VvUH/ZkN6zGvAb75/9puFHNliA/QCW1hp+ShXnNdn1eNdS7nhhPrzVqtCTy2 QeNWM/z5v9Wa1norPqDxzqWlh2bWW8JU0soX7Q+quN0d7YjVvmmUluL3Lw/V2zmb gFM2ZY43QHlmLVic4sSraK1LEcYFzjexzpTLhee2gNp+l2y0D0c1/hXukCk6YNUM gad9DH8P9d7By7Eyr0ZaPHSJbuBW1PqZhot5gCg9nCn4pnT2/y7wXsLj6VAw8gdr dWNu2MZWDuH0/d4aKfw2veAECbHUK2daok4ufPDj5nYLVVWCs4HU0U7HlYL2CX7/ TdWOCwtpFtKoN1NHz8mpET7xldxLPnFkByL+SxypTZurAZXoSnEG71IbO5pJ2iIf wHQkXgM+XimA =qUZ2 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEAD KVM x86 support for virtualizing Linear Address Masking (LAM) Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc. |
||
![]() |
01edb1cfbd |
KVM x86 PMU changes for 6.8:
- Fix a variety of bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW/rsSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5Q8gQAJc4y9NOd09kYXpI+DhkTVe6v07dmYds NzBI2uViqxXFwA5pTs5VTVVYAl1FEmK6NvIVnJdc3epSYRSqyaeN/Z2NoulNxekj /jLA/aA4+dTeJf2lfMFeH65IIuSJhuhyGeZV31RfW3NzEmlglcsb74QkHnJB8rLQ RFJXZcOxSSap72AWxKmxk0alRaI6ONZ9NyqOWFWjZdQuAE7id9Ae5OixKUrlJkmR 6CbY8ra51MFIXQEsomVlcl5b1DNiv0drPPf5YaC9T4CERtt5yZxpvZeTPhq70evm OutoZpzfi69cF1fFCxqN5cWZSt1C/Bu3xp8+ILI1+bZkMCV/ty85DU6hfMZQZzcV JeJkRg/AAgOrG4dtHskwg9LDMs867kgbaqZ8l8K7Dt8rGmcLc5/rZ1ZdjTStFj6V ukmVKMAVgkmh88u62wQ5HjrN1IE1oE6nmDp3zivfPuohEr49A8mAT02A2x9AVxAr HvmwfDMA92xOGSRAN9Gt0mbOA+G0WZe4A36XgPEXloYeskYZgHzgW2hT6VWTd86O ydU9s4L8g+Fy4jcObAiKsT8YwFgAMfVXZKTXvuTME4m/WUNBCrYCwqEOp/NM5qrk qYWVXxOMMjZo71tQfvSPu1TWCtW/4ckvmqMrdQosgwLFy5pSqgXEwTruDvbJ1KWU KhIWVbUfmgFA =+Emh -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEAD KVM x86 PMU changes for 6.8: - Fix a variety of bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. |
||
![]() |
33d0403fda |
KVM x86 misc changes for 6.8:
- Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+7sSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5/pwQAL8jIapIWP54VWxWlcTZFtCptGSobGlv cBS4L091/bYuMB/jO0pPtD+apzsYt3WmJ+tRsNA7Yctzh9BDE3XxbV7pKVIUpz9P TLCtYU2hPzp3vC6WCryjtU0OHxEnYMGHE1RCB7/bRblz+q6td7+MLZHcEUdwv83l 3pVM5+tNyQBog40frEVf+z7wrXzz2FgnauJn70X1UUs40VuiTzi6FqfLn6QK95xQ 8QPpjGFep7wQ6RgC4cPKiWSaP5PypCCpr4lMSKrKAf4iaKJdO1CYxEPeu0LcyFhR DUM3zb+AZ/FVrisRWUnjke4Epb87ikoMQBlflrI9+o4cNJQaxEHAzTMGO+u4oucy KwnXtNYM3lKGvDEvoUSBDphNayzcchn+0qk8YKB+XvClYSOtGi+NsWUB4x+M6crM 960cidF/CzYZL/IDj9GW2Tb+IiPJarmazdbqDmMpQiAKz0KE3tezGiysB6d6VJs1 V+KWOaSzAT9GsBKvGnPDHQaZ20vK+YsGB/TMWvpg3rFLTyV5QFM17UNdXyJlX0g8 G0v+gf7j3MKm156H2yYW0XhIAfhstc1Xb8fTDQjJ3pZn6us2NAtFgnrIpbL31Z7E yaSgZuxetswbNwVSECUGlH4/zAtQudBfAt837Nu4eSCjMrJE4SPrrwpbTqp0SPXd 1VZbGc70QFf7 =O4hV -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.8: - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. |
||
![]() |
0afdfd85e3 |
KVM x86 Hyper-V changes for 6.8:
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW8gYSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5sGUP/iadHMz7Up1X29IDGtq58LRORNVXp2Ln 2dqoj8IKZeSr+mPMw2GvZyuiLqVPMs4Et21WJfCO7HgKd/NPMDORwRndhJYweFRY yk+5NJLvXYuo8UR3b2QYy8XUghEqP+j5eYyon6UdCiPACcBGTpgoj4pU7SLM7l4T EOge42ya5YxD/1oWr5vyifNrOJCPNTBYcC0as5//+RdnmQYqYZ26Z73b0B8Pdct4 XMWwgoKlmLTmei0YntXtGaDGimCvTYP8EPM4tOWgiBSWMhQXWbAh/0biDfd3eZVO Hoe4HvstdjUNbpO3h3Zo78Ob7ehk4kx/6r0nlQnz5JxzGnuDjYCDIVUlYn0mw5Yi nu4ztr8M3VRksDbpmAjSO9XFEKIYxlYQfzZ1UuTy8ehdBYTDl/3lPAbh2ApUYE72 Tt2PXmFGz2j1sjG38Gh94s48Za5OxHoVlfq8iGhU4v7UjuxnMNHfExOWd66SwZgx 5tZkr4rj/pWt21wr7jaVqFGzuftIC5G4ZEBhh7JcW89oamFrykgQUu5z4dhBMO75 G7DAVh9eSH2SKkmJH1ClXriveazTK7fqMx8sZzzRnusMz09qH7SIdjSzmp7H5utw pWBfatft0n0FTI1r+hxGueiJt7dFlrIz0Q4hHyBN4saoVH121bZioc0pq1ob6MIk Y2Ou4xJBt14F =bjfs -----END PGP SIGNATURE----- Merge tag 'kvm-x86-hyperv-6.8' of https://github.com/kvm-x86/linux into HEAD KVM x86 Hyper-V changes for 6.8: - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. |
||
![]() |
54aa699e80 |
arch/x86: Fix typos
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org |
||
![]() |
136292522e |
LoongArch KVM changes for v6.8
1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support. -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmWGu+0WHGNoZW5odWFj YWlAa2VybmVsLm9yZwAKCRAChivD8uImesO7D/wOdYP96R+mRzpLBeuTtFxU8e4A 3n2luxOeP8v1WYtQ9H8M01Wgly+9u6cJ2pgAlv79BQHfmCfC0aWQLmpnCZmk/mYW wtQ75ASA3Qg6zOBWEksCkA0LUdPDHfQuaaUXT7RYZ7QtHKSNkkhsw2nMCq6fgrXU RnZjGctjuxgYSqQtwzfYO2AjSBAfAq1MjSzCTULJ0KkE8o5Bg0KOoGj8ijC1U+ua QWBnqTNzeKmYmqAFfhXoiiFYcuBUq7DEk5RtwDU7SeqqJEV3a8AbbsrWfz+wMemG gri95uRxvnhpPZ+6/PrVjIezqexPJmQ9+tjY6mxh/bPRnS5ICFygjV3lt050JUK8 xIaJEFvl7g88RIz5mnTeM9tU4ibIsCLgA9zj33ps2H7QP5NazUm1dzk1YGAgqPdw m5hjwtTFQEujQM6cz1DLfhoi15VDNcYUonJIvGFZMhl7InitDpB3u9sI+AVGIVUG yKzBkqGB1L1vbJGnuWmspEqSUo7Z9iYzuVGbOnjc9LKQ/8OpLxj0brymYheA+CKG CIdULximQFVEHc2lbE+H+bW4hnrFP4sN9hlTng7KN7ommCIg+FltisM8Nt5NLWID 9ywLj4Qa0Qrc5vB3FJ8+ksuDe2nD83uVLj247R7B0wxQcYw4ocyW/YU+gayF4EjY 6azutwllW5ZB+I3hyw== =phol -----END PGP SIGNATURE----- Merge tag 'loongarch-kvm-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support. |
||
![]() |
6254eebad4 |
KVM fixes for 6.7-rcN:
- When checking if a _running_ vCPU is "in-kernel", i.e. running at CPL0, get the CPL directly instead of relying on preempted_in_kernel, which is valid if and only if the vCPU was preempted, i.e. NOT running. - Set .owner for various KVM file_operations so that files refcount the KVM module until KVM is done executing _all_ code, including the last few instructions of kvm_put_kvm(). And then revert the misguided attempt to rely on "struct kvm" refcounts to pin KVM-the-module. - Fix a benign "return void" that was recently introduced. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmVyeV0SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5aJIP/izKZivi/kZjuuKp2c1W2XM+mBZlM+Yj qYcdV0rZygQJOZXTpMaEVg7iUtvbwAT495nm8sXr/IxXw+omMcP+qyLRCZ6JafYy B19buCnAt2DymlJOurFzlIEeWtunkxk/gFLMB/BnSrok88cKz5PMxAVFPPBXsTms ZqSFlDhzG0G4Mxhr8t0elyjd4HrCbNjCn1MhJg+uzFHKakfOvbST5jO02LkTeIM2 VFrqWZo1C6uPDrA8TzWzik54qOrDFrodNv/XvIJ0szgVOc+7Iwxy80A/v7o7jBET igH+6F3cbST5uoKFrFn7pPJdTOfX2u18DXcpxiYIu+24ToKyqdE1Np9M5W4ZMX/9 Im5ilykfylHpRYAL4tECD6Jzd/Q/xIvpe8Uk6HTfFAtb/UdMY35/1keBnnkI2oj8 /4USM7AHNiqoAs4+OE4kZslrFG8ttv3vIOr7Mtk2UjGyGp8TH8sRFYPJKToXsQIJ Gs96rsbiU+oo/IDp3UiRhWtwpwfKGbkDLp4r/3X6UOx6Re5u1ITVIoM14qFQaw3W CKdHKN/MoreYLS5gasjaGRSyQNJPonaS10l8SqzWflrUBZYfyjCNKliihjKood2g JykH4p69IFTWADT2VbrGCQVKfY1GJCxGwpGePFmChTsPUiQ2P+AbHCmaefnIRRbK 8UR/OmsDtFRZ =gWp0 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-fixes-6.7-rcN' of https://github.com/kvm-x86/linux into kvm-master KVM fixes for 6.7-rcN: - When checking if a _running_ vCPU is "in-kernel", i.e. running at CPL0, get the CPL directly instead of relying on preempted_in_kernel, which is valid if and only if the vCPU was preempted, i.e. NOT running. - Set .owner for various KVM file_operations so that files refcount the KVM module until KVM is done executing _all_ code, including the last few instructions of kvm_put_kvm(). And then revert the misguided attempt to rely on "struct kvm" refcounts to pin KVM-the-module. - Fix a benign "return void" that was recently introduced. |
||
![]() |
6d72283526 |
KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT
Unless explicitly told to do so (by passing 'clocksource=tsc' and 'tsc=stable:socket', and then jumping through some hoops concerning potential CPU hotplug) Xen will never use TSC as its clocksource. Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set in either the primary or secondary pvclock memory areas. This has led to bugs in some guest kernels which only become evident if PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support such guests, give the VMM a new Xen HVM config flag to tell KVM to forcibly clear the bit in the Xen pvclocks. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20231102162128.2353459-1-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
b4f69df0f6 |
KVM: x86: Make Hyper-V emulation optional
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be desirable to not compile it in to reduce module sizes as well as the attack surface. Introduce CONFIG_KVM_HYPERV option to make it possible. Note, there's room for further nVMX/nSVM code optimizations when !CONFIG_KVM_HYPERV, this will be done in follow-up patches. Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files are grouped together. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
cfef5af3cb |
KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation context
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg' placement in 'struct kvm_hv' is unfortunate. As a preparation to making Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it under CONFIG_HYPERV. While on it, introduce hv_get_partition_assist_page() helper to allocate partition assist page. Move the comment explaining why we use a single page for all vCPUs from VMX and expand it a bit. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
ef8d89033c |
KVM: x86: Remove 'return void' expression for 'void function'
The requested info will be stored in 'guest_xsave->region' referenced by the incoming pointer "struct kvm_xsave *guest_xsave", thus there is no need to explicitly use return void expression for a void function "static void kvm_vcpu_ioctl_x86_get_xsave(...)". The issue is caught with [-Wpedantic]. Fixes: 2d287ec65e79 ("x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer") Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231007064019.17472-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
f2f63f7ec6 |
KVM: x86/pmu: Stop calling kvm_pmu_reset() at RESET (it's redundant)
Drop kvm_vcpu_reset()'s call to kvm_pmu_reset(), the call is performed only for RESET, which is really just the same thing as vCPU creation, and kvm_arch_vcpu_create() *just* called kvm_pmu_init(), i.e. there can't possibly be any work to do. Unlike Intel, AMD's amd_pmu_refresh() does fill all_valid_pmc_idx even if guest CPUID is empty, but everything that is at all dynamic is guaranteed to be '0'/NULL, e.g. it should be impossible for KVM to have already created a perf event. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20231103230541.352265-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
c52ffadc65 |
KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the VM. Unnecessarily updating the masterclock is undesirable as it can cause kvmclock's time to jump, which is particularly painful on systems with a stable TSC as kvmclock _should_ be fully reliable on such systems. The unexpected time jumps are due to differences in the TSC=>nanoseconds conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW (the pvclock algorithm is inherently lossy). When updating the masterclock, KVM refreshes the "base", i.e. moves the elapsed time since the last update from the kvmclock/pvclock algorithm to the CLOCK_MONOTONIC_RAW algorithm. Synchronizing kvmclock with CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but adds no real value when the TSC is stable. Prior to commit |
||
![]() |
547c91929f |
KVM: x86: Get CPL directly when checking if loaded vCPU is in kernel mode
When querying whether or not a vCPU "is" running in kernel mode, directly
get the CPL if the vCPU is the currently loaded vCPU. In scenarios where
a guest is profiled via perf-kvm, querying vcpu->arch.preempted_in_kernel
from kvm_guest_state() is wrong if vCPU is actively running, i.e. isn't
scheduled out due to being preempted and so preempted_in_kernel is stale.
This affects perf/core's ability to accurately tag guest RIP with
PERF_RECORD_MISC_GUEST_{KERNEL|USER} and record it in the sample. This
causes perf/tool to fail to connect the vCPU RIPs to the guest kernel
space symbols when parsing these samples due to incorrect PERF_RECORD_MISC
flags:
Before (perf-report of a cpu-cycles sample):
1.23% :58945 [unknown] [u] 0xffffffff818012e0
After:
1.35% :60703 [kernel.vmlinux] [g] asm_exc_page_fault
Note, checking preempted_in_kernel in kvm_arch_vcpu_in_kernel() is awful
as nothing in the API's suggests that it's safe to use if and only if the
vCPU was preempted. That can be cleaned up in the future, for now just
fix the glaring correctness bug.
Note #2, checking vcpu->preempted is NOT safe, as getting the CPL on VMX
requires VMREAD, i.e. is correct if and only if the vCPU is loaded. If
the target vCPU *was* preempted, then it can be scheduled back in after
the check on vcpu->preempted in kvm_vcpu_on_spin(), i.e. KVM could end up
trying to do VMREAD on a VMCS that isn't loaded on the current pCPU.
Signed-off-by: Like Xu <likexu@tencent.com>
Fixes:
|
||
![]() |
b39bd520a6 |
KVM: x86: Untag addresses for LAM emulation where applicable
Stub in vmx_get_untagged_addr() and wire up calls from the emulator (via get_untagged_addr()) and "direct" calls from various VM-Exit handlers in VMX where LAM untagging is supposed to be applied. Defer implementing the guts of vmx_get_untagged_addr() to future patches purely to make the changes easier to consume. LAM is active only for 64-bit linear addresses and several types of accesses are exempted. - Cases need to untag address (handled in get_vmx_mem_address()) Operand(s) of VMX instructions and INVPCID. Operand(s) of SGX ENCLS. - Cases LAM doesn't apply to (no change needed) Operand of INVLPG. Linear address in INVPCID descriptor. Linear address in INVVPID descriptor. BASEADDR specified in SECS of ECREATE. Note: - LAM doesn't apply to write to control registers or MSRs - LAM masking is applied before walking page tables, i.e. the faulting linear address in CR2 doesn't contain the metadata. - The guest linear address saved in VMCS doesn't contain metadata. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-10-binbin.wu@linux.intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
37a41847b7 |
KVM: x86: Introduce get_untagged_addr() in kvm_x86_ops and call it in emulator
Introduce a new interface get_untagged_addr() to kvm_x86_ops to untag the metadata from linear address. Call the interface in linearization of instruction emulator for 64-bit mode. When enabled feature like Intel Linear Address Masking (LAM) or AMD Upper Address Ignore (UAI), linear addresses may be tagged with metadata that needs to be dropped prior to canonicality checks, i.e. the metadata is ignored. Introduce get_untagged_addr() to kvm_x86_ops to hide the vendor specific code, as sadly LAM and UAI have different semantics. Pass the emulator flags to allow vendor specific implementation to precisely identify the access type (LAM doesn't untag certain accesses). Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-9-binbin.wu@linux.intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
2c49db455e |
KVM: x86: Add & use kvm_vcpu_is_legal_cr3() to check CR3's legality
Add and use kvm_vcpu_is_legal_cr3() to check CR3's legality to provide a clear distinction between CR3 and GPA checks. This will allow exempting bits from kvm_vcpu_is_legal_cr3() without affecting general GPA checks, e.g. for upcoming features that will use high bits in CR3 for feature enabling. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-7-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
89ea60c2c7 |
KVM: x86: Add support for "protected VMs" that can utilize private memory
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development and testing vehicle for Confidential (CoCo) VMs, and potentially to even become a "real" product in the distant future, e.g. a la pKVM. The private memory support in KVM x86 is aimed at AMD's SEV-SNP and Intel's TDX, but those technologies are extremely complex (understatement), difficult to debug, don't support running as nested guests, and require hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX for maintaining guest private memory isn't a realistic option. At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of selftests for guest_memfd and private memory support without requiring unique hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20231027182217.3615211-24-seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
eed52e434b |
KVM: Allow arch code to track number of memslot address spaces per VM
Let x86 track the number of address spaces on a per-VM basis so that KVM can disallow SMM memslots for confidential VMs. Confidentials VMs are fundamentally incompatible with emulating SMM, which as the name suggests requires being able to read and write guest memory and register state. Disallowing SMM will simplify support for guest private memory, as KVM will not need to worry about tracking memory attributes for multiple address spaces (SMM is the only "non-default" address space across all architectures). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
90b4fe1798 |
KVM: x86: Disallow hugepages when memory attributes are mixed
Disallow creating hugepages with mixed memory attributes, e.g. shared versus private, as mapping a hugepage in this case would allow the guest to access memory with the wrong attributes, e.g. overlaying private memory with a shared hugepage. Tracking whether or not attributes are mixed via the existing disallow_lpage field, but use the most significant bit in 'disallow_lpage' to indicate a hugepage has mixed attributes instead using the normal refcounting. Whether or not attributes are mixed is binary; either they are or they aren't. Attempting to squeeze that info into the refcount is unnecessarily complex as it would require knowing the previous state of the mixed count when updating attributes. Using a flag means KVM just needs to ensure the current status is reflected in the memslots. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
ee605e3156 |
KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce the probability of exiting to userspace with a stale run->exit_reason that *appears* to be valid. To support fd-based guest memory (guest memory without a corresponding userspace virtual address), KVM will exit to userspace for various memory related errors, which userspace *may* be able to resolve, instead of using e.g. BUS_MCEERR_AR. And in the more distant future, KVM will also likely utilize the same functionality to let userspace "intercept" and handle memory faults when the userspace mapping is missing, i.e. when fast gup() fails. Because many of KVM's internal APIs related to guest memory use '0' to indicate "success, continue on" and not "exit to userspace", reporting memory faults/errors to userspace will set run->exit_reason and corresponding fields in the run structure fields in conjunction with a a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON. And because KVM already returns -EFAULT in many paths, there's a relatively high probability that KVM could return -EFAULT without setting run->exit_reason, in which case reporting KVM_EXIT_UNKNOWN is much better than reporting whatever exit reason happened to be in the run structure. Note, KVM must wait until after run->immediate_exit is serviced to sanitize run->exit_reason as KVM's ABI is that run->exit_reason is preserved across KVM_RUN when run->immediate_exit is true. Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com Link: https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-19-seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
16f95f3b95 |
KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
Add a new KVM exit type to allow userspace to handle memory faults that KVM cannot resolve, but that userspace *may* be able to handle (without terminating the guest). KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit conversions between private and shared memory. With guest private memory, there will be two kind of memory conversions: - explicit conversion: happens when the guest explicitly calls into KVM to map a range (as private or shared) - implicit conversion: happens when the guest attempts to access a gfn that is configured in the "wrong" state (private vs. shared) On x86 (first architecture to support guest private memory), explicit conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable as there is (obviously) no hypercall, and there is no guarantee that the guest actually intends to convert between private and shared, i.e. what KVM thinks is an implicit conversion "request" could actually be the result of a guest code bug. KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to be implicit conversions. Note! To allow for future possibilities where KVM reports KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's perspective), not '0'! Due to historical baggage within KVM, exiting to userspace with '0' from deep callstacks, e.g. in emulation paths, is infeasible as doing so would require a near-complete overhaul of KVM, whereas KVM already propagates -errno return codes to userspace even when the -errno originated in a low level helper. Report the gpa+size instead of a single gfn even though the initial usage is expected to always report single pages. It's entirely possible, likely even, that KVM will someday support sub-page granularity faults, e.g. Intel's sub-page protection feature allows for additional protections at 128-byte granularity. Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com Cc: Anish Moorthy <amoorthy@google.com> Cc: David Matlack <dmatlack@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20231027182217.3615211-10-seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
bb58b90b1a |
KVM: Introduce KVM_SET_USER_MEMORY_REGION2
Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional information can be supplied without setting userspace up to fail. The padding in the new kvm_userspace_memory_region2 structure will be used to pass a file descriptor in addition to the userspace_addr, i.e. allow userspace to point at a file descriptor and map memory into a guest that is NOT mapped into host userspace. Alternatively, KVM could simply add "struct kvm_userspace_memory_region2" without a new ioctl(), but as Paolo pointed out, adding a new ioctl() makes detection of bad flags a bit more robust, e.g. if the new fd field is guarded only by a flag and not a new ioctl(), then a userspace bug (setting a "bad" flag) would generate out-of-bounds access instead of an -EINVAL error. Cc: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-9-seanjc@google.com> Acked-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
be47941980 |
KVM SVM changes for 6.7:
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while running an SEV-ES guest. - Clean up handling "failures" when KVM detects it can't emulate the "skip" action for an instruction that has already been partially emulated. Drop a hack in the SVM code that was fudging around the emulator code not giving SVM enough information to do the right thing. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8GHYSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5hwkQAIR8l1gWz/caz29biBzmRnDS+aZOXcYM 8V8WBJqJgMKE9egibF4sADAlhInXzg19Xr7bQs6VfuvmdXrCn0UJ/nLorX+H85A2 pph6iNlWO6tyQAjvk/AieaeUyZOqpCFmKOgxfN2Fr/Lrn7u3AdjXC20qPeFJSLXr YOTCQ704yvjjJp4yVA8JlclAQu38hanKiO5SZdlLzbuhUgWwQk4DVP2ZsYnhX+RO F6exxORvMnYF/LJe/kR2/DMLf2JWvyUmjRrGWoeRoksOw5BlXMc5HyTPHSJ2jDac lJaNtmZkTY1bDVWZk7N03ze5aFJa4DaqJdIFLtgujrFW8thog0P48aH6vmKi4UAA bXme9GFYbmJTkemaGRnrzidFV12uPNvvanS+1PDOw4sn4HpscoMSpZw5PeH2kBwV 6uKNCJCwLtk8oe50yroKD7rJ/ASB7CeoqzbIL9s2TA0HSAskIf65T4eZp01uniyd Q98yCdrG2mudsg5aU5yMfe0LwZby5BB5kUCqIe4hyRC68GJR8wkAzhaFRgCn4aJE yaTyjnT2V3PGMEEJOPFdSF3VQGztljzQiXlEvBVj3zvMGQNTo2NhmS3ka4W+wW5G avRYv8dITlGRs6J2gV1vp8Eb5LzDrwRpRURSmzeP5rR58saKdljTZgNfOzfLeFr1 WhLzonLz52IS =U0fq -----END PGP SIGNATURE----- Merge tag 'kvm-x86-svm-6.7' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 6.7: - Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while running an SEV-ES guest. - Clean up handling "failures" when KVM detects it can't emulate the "skip" action for an instruction that has already been partially emulated. Drop a hack in the SVM code that was fudging around the emulator code not giving SVM enough information to do the right thing. |
||
![]() |
d5cde2e0b3 |
KVM PMU change for 6.7:
- Handle NMI/SMI requests after PMU/PMI requests so that a PMI=>NMI doesn't require redoing the entire run loop due to the NMI not being detected until the final kvm_vcpu_exit_request() check before entering the guest. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8G/sSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5/FQP/1B0tk5TMe/Xfe/q4ng+J2eMr10TpbH5 uWRpxN6seRmH7cqfZwsNH86FubNRf3h9U/jOK3C9Q9dIhrq9MB1dZePDjF/xmZcz 4lhM76fHTeRNxJ1o+j2ApiK9U2dDAbBTLA8iGi+OTs/sAuvbNUELY7d3Ht2TqJjb e9tGT+SavbTsg0UHEmteFHepMCe577AchL2T6jPbUaaVB05N7uD/qvIGDOLQvyaC KHWqY5f+eFN+3JdGEefCiS4XCAWXBPSs7Ybq5SduxS7rnB7m96Vkidwk1DLjnyUt +KNtb8JXBsMMuyaYZHrl4mPZyvOfmZxXOz9CzCYXzcQlsnkJqIyy3CiZFVEAqdq2 kXtOhNEqByAKVCWvcoJvfO/VGd/w3KP5XYP3GHXJ8gsS3sDORnL5PYWIvPNfjdlu x7nsnk7PbaGdspSPqfKblwUvET1fePs1yjKECUMl4iJ6Wfr+QfKEpPUXQ6f79r+h DrhPE9DIWyMMbre0p8E7uTFsteVerUx/GVDh7jtn6LCUKwWmKAZ43sKR2d35GAvG x7ZKCcKl5U9vmC8c6q/eAZUE7CeNy1QBGXhYX6oP28NGxl5AzZ/Q8aYMsv0uqhyF cwYbVKA5Wl5fovMrnjs8wwkKqa9cHdzy7JmhyhBV5k5ggfSUeD7mG0UM5eRxr6ZM TOa/97QeXa7v =mjsy -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pmu-6.7' of https://github.com/kvm-x86/linux into HEAD KVM PMU change for 6.7: - Handle NMI/SMI requests after PMU/PMI requests so that a PMI=>NMI doesn't require redoing the entire run loop due to the NMI not being detected until the final kvm_vcpu_exit_request() check before entering the guest. |
||
![]() |
e122d7a100 |
KVM x86 Xen changes for 6.7:
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n. - Use the fast path directly from the timer callback when delivering Xen timer events. Avoid the problematic races with using the fast path by ensuring the hrtimer isn't running when (re)starting the timer or saving the timer information (for userspace). - Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8He8SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5KyQP+wUH3n6hhJGScsSCpWXK6r8q+Y2ZBftY ecXuoTfeBJmsoTbnExF7K600DtbxHY5jjxt3ROmoUCertCFRCoq6pi5v4rbRDDQ1 fmGkht43A6zAuHQ0Ntvkq4rNEmISAbzLP4EXOxZJ/Hxld91T8IutMFo7NN/YfOSx nb+qgb7B25T7ODGvzahRjxnoevCHBN/TdKeDrvsoWeMpVw+CDYqquQOcLfHMaBAN DqGwZzpdVqRQqg3TOuBGCiv5IcvskjkFUh0y6cEYkCR/MruLoT6CygoLImEV2naW RU0ZU9Y4cjf+BV/faQEdP6mDQwwCUHWLxDpXUVn03KQYQHlA7q6UgRKxy35ixZ5w Euxvg4m2ZGgJjsVLqTTMUlbLSNxD6wWZAVxGH7w8XghKrNmoj1IoajPZS+1rwyO2 5rUynMKf3HMT6oeqqZH95aChlUMiAvaPYPc+ogku8Bt1zJQVv/xnk/6T95Vw6C/t KfYsV80rmJd/EL/fUXYX3mCMcZGHyv80QlOEc0uR4f25HGszCG8qHiSaUtnvQUjQ xaguSuO1Cf7sdhHPWj4p/US+Jerrgd8nzoQGvKUOkdLsQzU71xwjvTZNlmmBYKKO zgGIXZfaXa4JibAqnRrC+V8UdDPOwKvOEzmH0joLEzkTISnIG2LycvZ6tG7sTcMU 0sIg2dvhJx/G =Z2eM -----END PGP SIGNATURE----- Merge tag 'kvm-x86-xen-6.7' of https://github.com/kvm-x86/linux into HEAD KVM x86 Xen changes for 6.7: - Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n. - Use the fast path directly from the timer callback when delivering Xen timer events. Avoid the problematic races with using the fast path by ensuring the hrtimer isn't running when (re)starting the timer or saving the timer information (for userspace). - Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag. |
||
![]() |
f0f59d069e |
KVM x86 MMU changes for 6.7:
- Clean up code that deals with honoring guest MTRRs when the VM has non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled. - Zap EPT entries when non-coherent DMA assignment stops/start to prevent using stale entries with the wrong memtype. - Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as there's zero reason to ignore guest PAT if the effective MTRR memtype is WB. This will also allow for future optimizations of handling guest MTRR updates for VMs with non-coherent DMA and the quirk enabled. - Harden the fast page fault path to guard against encountering an invalid root when walking SPTEs. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8FG8SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5tYMP+gJd3raPnpmai4NyaFaZNP6/5YsXuUMj XBvHH7hBGHmjd1sV+O62fhUvNk4+M/1f1rERutP4s7yXEXxQfC9G/MQFgLBfyiW8 xR+RQkNrz8HsG8mHFBZ0Ei6OofhP+BRTYDRU7kbctKDh/4Hp5AOZAxYHs/ZhOho1 Lw6upZbQLCkdt72eEKbfocg6Tf400hWEyarBRXFe4KJzWq7KMjAPgqA/3Vx0lF6u zX73Zr6tV0mcf3QXd58Q4CUwOuwMo1aTangmOhEeC09JplF2okLV36h6WrCF8qqO gvmDrMA450Yc215peOJGBJzoZJrNjMIHZ2m+4Ifag6Z/jJoam4vjzUZmmrzx+Gbj Ot5lmXCVRXCdHmUNdYQ6yR27WaVP3C3ItkxwNZGMPoh2G08NGyLLY1kwzRyITEH4 M9jYTRBZaeue57ad5Ms9FaneBLWwPxajTX90rWZbl2kzfd8PG5cF1VroESBLoa0f I2kDcd7988xLTOMl1sfO8ci21Ve7rQc0hA6WlOXrDxb26OvYrftYXeXOCowN6kqP czXIu5ZPmLI1btimZQXGMdxKkw5wwe3wDC3y5gKrm+rTfORUXoOUDoITIpmPCnAp Dzfr5la3RI1GjHhzR80x4vXQC9BgJ9WrEwJub/RqVfE3T3ohw+NZl+AeM1xB9eT1 2mJWm6GFEm9Y =Zfbr -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmu-6.7' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.7: - Clean up code that deals with honoring guest MTRRs when the VM has non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled. - Zap EPT entries when non-coherent DMA assignment stops/start to prevent using stale entries with the wrong memtype. - Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as there's zero reason to ignore guest PAT if the effective MTRR memtype is WB. This will also allow for future optimizations of handling guest MTRR updates for VMs with non-coherent DMA and the quirk enabled. - Harden the fast page fault path to guard against encountering an invalid root when walking SPTEs. |
||
![]() |
f292dc8aad |
KVM x86 misc changes for 6.7:
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without forcing more common use cases to eat the extra memory overhead. - Add IBPB and SBPB virtualization support. - Fix a bug where restoring a vCPU snapshot that was taken within 1 second of creating the original vCPU would cause KVM to try to synchronize the vCPU's TSC and thus clobber the correct TSC being set by userspace. - Compute guest wall clock using a single TSC read to avoid generating an inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads. - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain about a "Firmware Bug" if the bit isn't set for select F/M/S combos. - Don't apply side effects to Hyper-V's synthetic timer on writes from userspace to fix an issue where the auto-enable behavior can trigger spurious interrupts, i.e. do auto-enabling only for guest writes. - Remove an unnecessary kick of all vCPUs when synchronizing the dirty log without PML enabled. - Advertise "support" for non-serializing FS/GS base MSR writes as appropriate. - Use octal notation for file permissions through KVM x86. - Fix a handful of typo fixes and warts. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8EugSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5xS0P+gPTDO81CUZO70LrO2W4E7toRBf/F9x1 /v5D/76p9hG32Z6+BJs/xxDxJFagw75MtoR5oKivtXiip3TxbfOyDOlaQkIRo85E /d95il/LRidL3Mv3TXRj1lykXnxSSz9tigAGEZti1Y9Fn9fXEIwurJH7dU5cBI1E fin5bsDaTNRjG4jjTiEUbnKPRTlD/S7CQJn4CaYvZhMv/eJkYDLyBBVy4VLoLzvD ctL6VJQLGPVxbxr9mEmulaqMrSuDIQQLkRVQJAViKyerBInTEc5d/GPCHuE8O3zi 0r/QSJbMS9titWLz07NhJ1UH4VJNyaEhRlyJPSFhBW4h6dzUb3EXdUe0Hwa+JH/S H2cVqsANItTCIhvDtuEGIRDahu0eD+63h90InJ0gEVL1kSJS+UWZHB71PkUEQgAV 2OsuT1D26fuxrv+0b9ioBZURycqKw++zGsrwyVhe77eBgqBJ12tbL4TAD+QNjaQ5 HZTCe6YV83gZoOMeVkoTGSf96s9lGORgxsaAIXmFuLB9RVCVXhVh0ph2HZsnV8Hw ZXEXpBEFo7GUhb0NIvsk2W73QL87A3fLv15yITWc8KuC7/dXP9z6KpSKjFySS69X uWD1MVx6shhvbg97UzoJlXc3/z0aVzmdZJudE5d0gcFvAjIItqp6ICPOoKxfj8pT tqRZu3kVHd61 =sfp8 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.7' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.7: - Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without forcing more common use cases to eat the extra memory overhead. - Add IBPB and SBPB virtualization support. - Fix a bug where restoring a vCPU snapshot that was taken within 1 second of creating the original vCPU would cause KVM to try to synchronize the vCPU's TSC and thus clobber the correct TSC being set by userspace. - Compute guest wall clock using a single TSC read to avoid generating an inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads. - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain about a "Firmware Bug" if the bit isn't set for select F/M/S combos. - Don't apply side effects to Hyper-V's synthetic timer on writes from userspace to fix an issue where the auto-enable behavior can trigger spurious interrupts, i.e. do auto-enabling only for guest writes. - Remove an unnecessary kick of all vCPUs when synchronizing the dirty log without PML enabled. - Advertise "support" for non-serializing FS/GS base MSR writes as appropriate. - Use octal notation for file permissions through KVM x86. - Fix a handful of typo fixes and warts. |
||
![]() |
fad505b2cb |
KVM: x86: Service NMI requests after PMI requests in VM-Enter path
Service NMI and SMI requests after PMI requests in vcpu_enter_guest() so that KVM does not need to cancel and redo the VM-Enter if the guest configures its PMIs to be delivered as NMIs (likely) or SMIs (unlikely). Because APIC emulation "injects" NMIs via KVM_REQ_NMI, handling PMI requests after NMI requests (the likely case) means KVM won't detect the pending NMI request until the final check for outstanding requests. Detecting requests at the final stage is costly as KVM has already loaded guest state, potentially queued events for injection, disabled IRQs, dropped SRCU, etc., most of which needs to be unwound. Note that changing the order of request processing doesn't change the end result, as KVM's final check for outstanding requests prevents entering the guest until all requests are serviced. I.e. KVM will ultimately coalesce events (or not) regardless of the ordering. Using SPEC2017 benchmark programs running along with Intel vtune in a VM demonstrates that the following code change reduces 800~1500 canceled VM-Enters per second. Some glory details: Probe the invocation to vmx_cancel_injection(): $ perf probe -a vmx_cancel_injection $ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds Partial results when SPEC2017 with Intel vtune are running in the VM: On kernel without the change: 10.010018010 14254 probe:vmx_cancel_injection 20.037646388 15207 probe:vmx_cancel_injection 30.078739816 15261 probe:vmx_cancel_injection 40.114033258 15085 probe:vmx_cancel_injection 50.149297460 15112 probe:vmx_cancel_injection 60.185103088 15104 probe:vmx_cancel_injection On kernel with the change: 10.003595390 40 probe:vmx_cancel_injection 20.017855682 31 probe:vmx_cancel_injection 30.028355883 34 probe:vmx_cancel_injection 40.038686298 31 probe:vmx_cancel_injection 50.048795162 20 probe:vmx_cancel_injection 60.069057747 19 probe:vmx_cancel_injection Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20231002040839.2630027-1-mizhang@google.com [sean: hoist PMU/PMI above SMI too, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
2770d47220 |
KVM: x86: Ignore MSR_AMD64_TW_CFG access
Hyper-V enabled Windows Server 2022 KVM VM cannot be started on Zen1 Ryzen
since it crashes at boot with SYSTEM_THREAD_EXCEPTION_NOT_HANDLED +
STATUS_PRIVILEGED_INSTRUCTION (in other words, because of an unexpected #GP
in the guest kernel).
This is because Windows tries to set bit 8 in MSR_AMD64_TW_CFG and can't
handle receiving a #GP when doing so.
Give this MSR the same treatment that commit
|
||
![]() |
122ae01c51 |
KVM: x86: remove the unused assigned_dev_head from kvm_arch
Legacy device assignment was dropped years ago. This field is not used anymore. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Link: https://lore.kernel.org/r/20231019043336.8998-1-liangchen.linux@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
2081a8450e |
KVM: x86: remove always-false condition in kvmclock_sync_fn
The 'kvmclock_periodic_sync' is a readonly param that cannot change after bootup. The kvm_arch_vcpu_postcreate() is not going to schedule the kvmclock_sync_work if kvmclock_periodic_sync == false. As a result, the "if (!kvmclock_periodic_sync)" can never be true if the kvmclock_sync_work = kvmclock_sync_fn() is scheduled. Link: https://lore.kernel.org/kvm/a461bf3f-c17e-9c3f-56aa-726225e8391d@oracle.com Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20231001213637.76686-1-dongli.zhang@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
3d30bfcbdc |
KVM: x86/mmu: Stop kicking vCPUs to sync the dirty log when PML is disabled
Stop kicking vCPUs in kvm_arch_sync_dirty_log() when PML is disabled. Kicking vCPUs when PML is disabled serves no purpose and could negatively impact guest performance. This restores KVM's behavior to prior to 5.12 commit |
||
![]() |
26951ec862 |
KVM: x86: Use octal for file permission
Convert all module params to octal permissions to improve code readability and to make checkpatch happy: WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/r/20231013113020.77523-1-flyingpeng@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
88e4cd893f |
KVM x86/pmu fixes for 6.6:
- Truncate writes to PMU counters to the counter's width to avoid spurious overflows when emulating counter events in software. - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined architectural behavior). - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to kick the guest out of emulated halt. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmUp1FESHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5IRsQAIsk+UwTP+q+ZzkpkSOJ+ocmKU97/GbW snB+F5FwNXnWEPzHIV+Ldv+WUpmHilTrylk2t5jLyew783TPxTnLmNAa+D3iSSBP jSGzCIqR2uRHOxhuJgkKvdOkfuS7vob1KcKrfOwKCSss78VhKGkMGIi66/81RTxo zxpzva+F2YtbCwKWXewOvR4CsWhjVqOGRTCmjF6t8PpFDGqwZdu0ornBHC2gvkUI iDHWVBg5Rz/akqxjEVL94SP5qdFSaVG+F3Z8xpnn+tfPncEK/xPFdGHGKwOy5Jvt 4dQLc6TGmS2+NGPU3eAJOr+GZKryQth1CI+5RDlnoKQXjQ3laJwjmgyCRbUYLoZh /R7f5YJrhGheUvCCmagY1g2x41qp/CTG1RnX1SVTIGH9h+5LSVcCukCL9Tx2/B4v eU8nrzhUuijSqG6TiyAV5hvFqMQf3LWWcjSSW58kIWmXLpqdb/Xp6wiFHjOM7wZM c1br+6AwKZwKNdqn3/cnlBnLc+1jq/PWFnuF9svjKn5JTOyg8kddmyWUkDqiLOeZ /jqqwRJQUZppy4DxFHdkuQxnTsrztNzs/vhQtF6MIgFRULrs4FaiTUxuAs72skqm Fv/IIuyHWjST9HY8dgTx8PLqUevEc7zekmhN1Cj5KwhlHxKYWSZfew80CO7h2qhJ IvAC70QC+BsW =g8g3 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pmu-6.6-fixes' of https://github.com/kvm-x86/linux into HEAD KVM x86/pmu fixes for 6.6: - Truncate writes to PMU counters to the counter's width to avoid spurious overflows when emulating counter events in software. - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined architectural behavior). - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to kick the guest out of emulated halt. |
||
![]() |
8647c52e95 |
KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
Mask off xfeatures that aren't exposed to the guest only when saving guest state via KVM_GET_XSAVE{2} instead of modifying user_xfeatures directly. Preserving the maximal set of xfeatures in user_xfeatures restores KVM's ABI for KVM_SET_XSAVE, which prior to commit |
||
![]() |
18164f66e6 |
x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer
Plumb an xfeatures mask into __copy_xstate_to_uabi_buf() so that KVM can constrain which xfeatures are saved into the userspace buffer without having to modify the user_xfeatures field in KVM's guest_fpu state. KVM's ABI for KVM_GET_XSAVE{2} is that features that are not exposed to guest must not show up in the effective xstate_bv field of the buffer. Saving only the guest-supported xfeatures allows userspace to load the saved state on a different host with a fewer xfeatures, so long as the target host supports the xfeatures that are exposed to the guest. KVM currently sets user_xfeatures directly to restrict KVM_GET_XSAVE{2} to the set of guest-supported xfeatures, but doing so broke KVM's historical ABI for KVM_SET_XSAVE, which allows userspace to load any xfeatures that are supported by the *host*. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230928001956.924301-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
362ff6dca5 |
KVM: x86/mmu: Zap KVM TDP when noncoherent DMA assignment starts/stops
Zap KVM TDP when noncoherent DMA assignment starts (noncoherent dma count transitions from 0 to 1) or stops (noncoherent dma count transitions from 1 to 0). Before the zap, test if guest MTRR is to be honored after the assignment starts or was honored before the assignment stops. When there's no noncoherent DMA device, EPT memory type is ((MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) When there're noncoherent DMA devices, EPT memory type needs to honor guest CR0.CD and MTRR settings. So, if noncoherent DMA count transitions between 0 and 1, EPT leaf entries need to be zapped to clear stale memory type. This issue might be hidden when the device is statically assigned with VFIO adding/removing MMIO regions of the noncoherent DMA devices for several times during guest boot, and current KVM MMU will call kvm_mmu_zap_all_fast() on the memslot removal. But if the device is hot-plugged, or if the guest has mmio_always_on for the device, the MMIO regions of it may only be added for once, then there's no path to do the EPT entries zapping to clear stale memory type. Therefore do the EPT zapping when noncoherent assignment starts/stops to ensure stale entries cleaned away. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065223.20432-1-yan.y.zhao@intel.com [sean: fix misspelled words in comment and changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
bf328e22e4 |
KVM: x86: Don't sync user-written TSC against startup values
The legacy API for setting the TSC is fundamentally broken, and only allows userspace to set a TSC "now", without any way to account for time lost between the calculation of the value, and the kernel eventually handling the ioctl. To work around this, KVM has a hack which, if a TSC is set with a value which is within a second's worth of the last TSC "written" to any vCPU in the VM, assumes that userspace actually intended the two TSC values to be in sync and adjusts the newly-written TSC value accordingly. Thus, when a VMM restores a guest after suspend or migration using the legacy API, the TSCs aren't necessarily *right*, but at least they're in sync. This trick falls down when restoring a guest which genuinely has been running for less time than the 1 second of imprecision KVM allows for in in the legacy API. On *creation*, the first vCPU starts its TSC counting from zero, and the subsequent vCPUs synchronize to that. But then when the VMM tries to restore a vCPU's intended TSC, because the VM has been alive for less than 1 second and KVM's default TSC value for new vCPU's is '0', the intended TSC is within a second of the last "written" TSC and KVM incorrectly adjusts the intended TSC in an attempt to synchronize. But further hacks can be piled onto KVM's existing hackish ABI, and declare that the *first* value written by *userspace* (on any vCPU) should not be subject to this "correction", i.e. KVM can assume that the first write from userspace is not an attempt to sync up with TSC values that only come from the kernel's default vCPU creation. To that end: Add a flag, kvm->arch.user_set_tsc, protected by kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in the VM *has* been set by userspace, and make the 1-second slop hack only trigger if user_set_tsc is already set. Note that userspace can explicitly request a *synchronization* of the TSC by writing zero. For the purpose of user_set_tsc, an explicit synchronization counts as "setting" the TSC, i.e. if userspace then subsequently writes an explicit non-zero value which happens to be within 1 second of the previous value, the new value will be "corrected". This behavior is deliberate, as treating explicit synchronization as "setting" the TSC preserves KVM's existing behaviour inasmuch as possible (KVM always applied the 1-second "correction" regardless of whether the write came from userspace vs. the kernel). Reported-by: Yong He <alexyonghe@tencent.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423 Suggested-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Tested-by: Yong He <alexyonghe@tencent.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231008025335.7419-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
7a18c7c2b6 |
KVM: x86/mmu: Zap SPTEs when CR0.CD is toggled iff guest MTRRs are honored
Zap SPTEs when CR0.CD is toggled if and only if KVM's MMU is honoring guest MTRRs, which is the only time that KVM incorporates the guest's CR0.CD into the final memtype. Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065122.20315-1-yan.y.zhao@intel.com [sean: rephrase shortlog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
8b0e00fba9 |
KVM: x86: Virtualize HWCR.TscFreqSel[bit 24]
On certain CPUs, Linux guests expect HWCR.TscFreqSel[bit 24] to be set. If it isn't set, they complain: [Firmware Bug]: TSC doesn't count with P0 frequency! Allow userspace (and the guest) to set this bit in the virtual HWCR to eliminate the above complaint. Allow the guest to write the bit even though its is R/O on *some* CPUs. Like many bits in HWRC, TscFreqSel is not architectural at all. On Family 10h[1], it was R/W and powered on as 0. In Family 15h, one of the "changes relative to Family 10H Revision D processors[2] was: • MSRC001_0015 [Hardware Configuration (HWCR)]: • Dropped TscFreqSel; TSC can no longer be selected to run at NB P0-state. Despite the "Dropped" above, that same document later describes HWCR[bit 24] as follows: TscFreqSel: TSC frequency select. Read-only. Reset: 1. 1=The TSC increments at the P0 frequency If the guest clears the bit, the worst case scenario is the guest will be no worse off than it is today, e.g. the whining may return after a guest clears the bit and kexec()'s into a new kernel. [1] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/31116.pdf [2] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/42301_15h_Mod_00h-0Fh_BKDG.pdf, Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20230929230246.1954854-3-jmattson@google.com [sean: elaborate on why the bit is writable by the guest] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
598a790fc2 |
KVM: x86: Allow HWCR.McStatusWrEn to be cleared once set
When HWCR is set to 0, store 0 in vcpu->arch.msr_hwcr.
Fixes:
|
||
![]() |
5d6d6a7d7e |
KVM: x86: Refine calculation of guest wall clock to use a single TSC read
When populating the guest's PV wall clock information, KVM currently does a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern which should be avoided; when working with the relationship between two clocks, it's never correct to obtain one of them "now" and then the other at a slightly different "now" after an unspecified period of preemption (which might not even be under the control of the kernel, if this is an L1 hosting an L2 guest under nested virtualization). Add a kvm_get_wall_clock_epoch() function to return the guest wall clock epoch in nanoseconds using the same method as __get_kvmclock() — by using kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM clock time from a *single* TSC reading. The condition using get_cpu_tsc_khz() is equivalent to the version in __get_kvmclock() which separately checks for the CONSTANT_TSC feature or the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/bfc6d3d7cfb88c47481eabbf5a30a264c58c7789.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
e47d86083c |
KVM: x86: Add SBPB support
Add support for the AMD Selective Branch Predictor Barrier (SBPB) by advertising the CPUID bit and handling PRED_CMD writes accordingly. Note, like SRSO_NO and IBPB_BRTYPE before it, advertise support for SBPB even if it's not enumerated by in the raw CPUID. Some CPUs that gained support via a uCode patch don't report SBPB via CPUID (the kernel forces the flag). Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/a4ab1e7fe50096d50fde33e739ed2da40b41ea6a.1692919072.git.jpoimboe@kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
0068299540 |
KVM: SVM: Treat all "skip" emulation for SEV guests as outright failures
Treat EMULTYPE_SKIP failures on SEV guests as unhandleable emulation instead of simply resuming the guest, and drop the hack-a-fix which effects that behavior for the INT3/INTO injection path. If KVM can't skip an instruction for which KVM has already done partial emulation, resuming the guest is undesirable as doing so may corrupt guest state. Link: https://lore.kernel.org/r/20230825013621.2845700-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
aeb904f6b9 |
KVM: x86: Refactor can_emulate_instruction() return to be more expressive
Refactor and rename can_emulate_instruction() to allow vendor code to return more than true/false, e.g. to explicitly differentiate between "retry", "fault", and "unhandleable". For now, just do the plumbing, a future patch will expand SVM's implementation to signal outright failure if KVM attempts EMULTYPE_SKIP on an SEV guest. No functional change intended (or rather, none that are visible to the guest or userspace). Link: https://lore.kernel.org/r/20230825013621.2845700-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
ee11ab6bb0 |
KVM: X86: Reduce size of kvm_vcpu_arch structure when CONFIG_KVM_XEN=n
When CONFIG_KVM_XEN=n, the size of kvm_vcpu_arch can be reduced from 5100+ to 4400+ by adding macro control. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/all/CAPm50aKwbZGeXPK5uig18Br8CF1hOS71CE2j_dLX+ub7oJdpGg@mail.gmail.com [sean: fix whitespace damage] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
4346db6e6e |
KVM: x86: Force TLB flush on userspace changes to special registers
Userspace can directly modify the content of vCPU's CR0, CR3, and CR4 via KVM_SYNC_X86_SREGS and KVM_SET_SREGS{,2}. Make sure that KVM flushes guest TLB entries and paging-structure caches if a (partial) guest TLB flush is architecturally required based on the CRn changes. To keep things simple, flush whenever KVM resets the MMU context, i.e. if any bits in CR0, CR3, CR4, or EFER are modified. This is extreme overkill, but stuffing state from userspace is not such a hot path that preserving guest TLB state is a priority. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230814222358.707877-3-mhal@rbox.co [sean: call out that the flushing on MMU context resets is for simplicity] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
9dbb029b9c |
KVM: x86: Remove redundant vcpu->arch.cr0 assignments
Drop the vcpu->arch.cr0 assignment after static_call(kvm_x86_set_cr0). CR0 was already set by {vmx,svm}_set_cr0(). Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230814222358.707877-2-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
73554b29bd |
KVM: x86/pmu: Synthesize at most one PMI per VM-exit
When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
VM-exit that also invokes __kvm_perf_overflow() as a result of
instruction emulation, kvm_pmu_deliver_pmi() will be called twice
before the next VM-entry.
Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
KVM sets the LVTPC mask bit when delivering a PMI. But using IRQ work to
trigger the PMI is still broken, albeit very theoretically.
E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
vCPU to be migrated to a different pCPU, then it's possible for
kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
KVM_REQ_PMI and still generate two PMIs.
KVM could set the mask bit using an atomic operation, but that'd just be
piling on unnecessary code to workaround what is effectively a hack. The
*only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
event, e.g. if the vCPU just executed HLT.
Remove the irq_work callback for synthesizing a PMI, and all of the
logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().
Fixes:
|
||
![]() |
0df9dab891 |
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
58ea7cf700 |
KVM: x86/mmu: Move KVM-only page-track declarations to internal header
Bury the declaration of the page-track helpers that are intended only for internal KVM use in a "private" header. In addition to guarding against unwanted usage of the internal-only helpers, dropping their definitions avoids exposing other structures that should be KVM-internal, e.g. for memslots. This is a baby step toward making kvm_host.h a KVM-internal header in the very distant future. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
b83ab124de |
KVM: x86: Add a new page-track hook to handle memslot deletion
Add a new page-track hook, track_remove_region(), that is called when a memslot DELETE operation is about to be committed. The "remove" hook will be used by KVMGT and will effectively replace the existing track_flush_slot() altogether now that KVM itself doesn't rely on the "flush" hook either. The "flush" hook is flawed as it's invoked before the memslot operation is guaranteed to succeed, i.e. KVM might ultimately keep the existing memslot without notifying external page track users, a.k.a. KVMGT. In practice, this can't currently happen on x86, but there are no guarantees that won't change in the future, not to mention that "flush" does a very poor job of describing what is happening. Pass in the gfn+nr_pages instead of the slot itself so external users, i.e. KVMGT, don't need to exposed to KVM internals (memslots). This will help set the stage for additional cleanups to the page-track APIs. Opportunistically align the existing srcu_read_lock_held() usage so that the new case doesn't stand out like a sore thumb (and not aligning the new code makes bots unhappy). Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230729013535.1070024-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
c70934e0ab |
KVM: x86: Reject memslot MOVE operations if KVMGT is attached
Disallow moving memslots if the VM has external page-track users, i.e. if
KVMGT is being used to expose a virtual GPU to the guest, as KVMGT doesn't
correctly handle moving memory regions.
Note, this is potential ABI breakage! E.g. userspace could move regions
that aren't shadowed by KVMGT without harming the guest. However, the
only known user of KVMGT is QEMU, and QEMU doesn't move generic memory
regions. KVM's own support for moving memory regions was also broken for
multiple years (albeit for an edge case, but arguably moving RAM is
itself an edge case), e.g. see commit
|
||
![]() |
db0d70e610 |
KVM: x86/mmu: Move kvm_arch_flush_shadow_{all,memslot}() to mmu.c
Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only for kvm_arch_flush_shadow_all(). This will allow refactoring kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping everything in mmu.c will also likely simplify supporting TDX, which intends to do zap only relevant SPTEs on memslot updates. No functional change intended. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
6d5e3c318a |
KVM x86 changes for 6.6:
- Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTueMwSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5hp4P/i/UmIJEJupryUrD/ZXcSjqmupCtv4JS Z2o1KIAPbM5GUX4iyF1cnZrI4Ac5zMtULN8Tp3ATOp3AqKy72AqB1Z82e+v6SKis KfSXlDFCPFisrwv3Ys7JEu9vIS8oqITHmSBk8OAmElwujdQ5jYLZjwGbCXbM9qas yCFGLqD4fjX8XqkZLmXggjT99MPSgiTPoKL592Wq4JR8mY4hyQqJzBepDjb94sT7 wrsAv1B+BchGDguk0+nOdmHM4emGrZU7fVqi3OFPofSlwAAdkqZObleb422KB058 5bcpNow+9VH5pzgq8XSAU7DLNgH9aXH0PcVU8ASU6P0D9fceKoOFuL47nnFbwz0t vKafcXNWFs8xHE4iyzvAAsZK/X8GR0ngNByPnamATMsjt2tTmsa5BOyAPkIN+GpT DzZCIk27SbdGC3lGYlSV+5ob/+sOr6m384DkvSZnU6JiiFLlZiTxURj1/9Zvfka8 2co2wnf8cJxnKFUThFfuxs9XpKgvhkOE8LauwCSo4MAQM95Pen+NAK960RBWj0xl wof5kIGmKbwmMXyg2Sr+EKqe5KRPba22Yi3x24tURAXafKK/AW7T8dgEEXOll7dp pKmTPAevwUk9wYIGultjhEBXKYgMOeD2BVoTa5je5h1Da28onrSJ7aLQUixHHs0J gLdtzs8M9K9t =yGM1 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD KVM x86 changes for 6.6: - Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID |
||
![]() |
1814db83c0 |
KVM: x86: Selftests changes for 6.6:
- Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs - Add support for printf() in guest code and covert all guest asserts to use printf-based reporting - Clean up the PMU event filter test and add new testcases - Include x86 selftests in the KVM x86 MAINTAINERS entry -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTueu4SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5wvIQAK8jWhb1Y4CzrJmcZyYYIR6apgtXl4vB KbhFIFHi5ZeZXlpXA2o/FW8Q9LNmcRLtxoapb09t/eyb0+ODllDPt/aSG7p6Y4p9 rNb1g6Hj77LTaG5gMy7/lbk9ERzf61+MKUuucU7WzjlY8oyd+lm+y2cx2O3+S/89 C5cp2CGnqK2NMbUnzYN8izMrdvtwDvgQvm3H7Ah8yrGXJkcemVggXibuh+2coTfo p2RKrY+A4Syw/edNe0GVZYoSVJdwPEif8o0gAz5PwC2LTjpf9Iobt89KEx08BkVw ms0MFbwLS66MoSYIVoZkBdy/Tri5aCKxHGqu7taEWhogjbzrPvktA6PNYihO4zGa OSjA/oyAPvFJ4cLuBlrVh/xPWVoGX/6Sx3dBP5TI3zyR0FAqZkoAPDivWhflOpTt q3aoHr6THGRzqHOCYuX7nwzhqBFSSHUF1zy/P7rThSzieSzUiJiANUwBjTeB9Wsr 5Cn+KQ8XOZw1LVcoeI9y97xcHh9HeP3seO+MFie8OH9QK4nUqgqEbF8sp7WF0rB6 6rZ1lht9a2Qx4xdtqSMBkQdgnnaiCZ7jBtEFMK6kSQ67zvorlCwkOue3TrtorJ4H 1XI/DGAzltEfCLMAq+4FkHkkEr84S3gRjaLlI9aHWlVrSk1wxM87R16jgVfJp74R gTNAzCys2KwM =dHTQ -----END PGP SIGNATURE----- Merge tag 'kvm-x86-selftests-6.6' of https://github.com/kvm-x86/linux into HEAD KVM: x86: Selftests changes for 6.6: - Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs - Add support for printf() in guest code and covert all guest asserts to use printf-based reporting - Clean up the PMU event filter test and add new testcases - Include x86 selftests in the KVM x86 MAINTAINERS entry |
||
![]() |
e0fb12c673 |
KVM/arm64 updates for Linux 6.6
- Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes... -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmTsXrUPHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDZpIQAJUM1rNEOJ8ExYRfoG1LaTfcOm5TD6D1IWlO uCUx4xLMBudw/55HusmUSdiomQ3Xg5UdRaU7vX5OYwPbdoWebjEUfgdP3jCA/TiW mZTMv3x9hOvp+EOS/UnS469cERvg1/KfwcdOQsWL0HsCFZnu2XmQHWPD++vovLNp F1892ij875mC6C6mOR60H2nyjIiCuqWh/8eKBkp65CARCbFDYxWhqBnmcmTvoquh E87pQDPdtgXc0KlOWCABh5bYOu1WGVEXE5f3ixtdY9cQakkSI3NkFKw27/mIWS4q TCsagByNnPFDXTglb1dJopNdluLMFi1iXhRJX78R/PYaHxf4uFafWcQk1U7eDdLg 1kPANggwYe4KNAQZUvRhH7lIPWHCH0r4c1qHV+FsiOZVoDOSKHo4RW1ZFtirJSNW LNJMdk+8xyae0S7z164EpZB/tpFttX4gl3YvUT/T+4gH8+CRFAaoAlK39CoGDPpk f+P2GE1Z5YupF16YjpZtBnan55KkU1b6eORl5zpnAtoaz5WGXqj1t4qo0Q6e9WB9 X4rdDVhH7vRUmhjmSP6PuEygb84hnITLdGpkH2BmWj/4uYuCN+p+U2B2o/QdMJoo cPxdflLOU/+1gfAFYPtHVjVKCqzhwbw3iLXQpO12gzRYqE13rUnAr7RuGDf5fBVC LW7Pv81o =DKhx -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.6 - Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes... |
||
![]() |
fe60e8f65f |
KVM: x86: Use KVM-governed feature framework to track "XSAVES enabled"
Use the governed feature framework to track if XSAVES is "enabled", i.e. if XSAVES can be used by the guest. Add a comment in the SVM code to explain the very unintuitive logic of deliberately NOT checking if XSAVES is enumerated in the guest CPUID model. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20230815203653.519297-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
392a532462 |
x86: kvm: x86: Remove unnecessary initial values of variables
bitmap and khz is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20230817002631.2885-1-zeming@nfschina.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
7b0151caf7 |
KVM: x86: Remove WARN sanity check on hypervisor timer vs. UNINITIALIZED vCPU
Drop the WARN in KVM_RUN that asserts that KVM isn't using the hypervisor timer, a.k.a. the VMX preemption timer, for a vCPU that is in the UNINITIALIZIED activity state. The intent of the WARN is to sanity check that KVM won't drop a timer interrupt due to an unexpected transition to UNINITIALIZED, but unfortunately userspace can use various ioctl()s to force the unexpected state. Drop the sanity check instead of switching from the hypervisor timer to a software based timer, as the only reason to switch to a software timer when a vCPU is blocking is to ensure the timer interrupt wakes the vCPU, but said interrupt isn't a valid wake event for vCPUs in UNINITIALIZED state *and* the interrupt will be dropped in the end. Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com> Closes: https://lore.kernel.org/all/CALcu4rbFrU4go8sBHk3FreP+qjgtZCGcYNpSiEXOLm==qFv7iQ@mail.gmail.com Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230808232057.2498287-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
765da7fe0e |
KVM: x86: Remove break statements that will never be executed
Fix compiler warnings when compiling KVM with [-Wunreachable-code-break]. No functional change intended. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230807094243.32516-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
619b507244 |
KVM: Move kvm_arch_flush_remote_tlbs_memslot() to common code
Move kvm_arch_flush_remote_tlbs_memslot() to common code and drop "arch_" from the name. kvm_arch_flush_remote_tlbs_memslot() is just a range-based TLB invalidation where the range is defined by the memslot. Now that kvm_flush_remote_tlbs_range() can be called from common code we can just use that and drop a bunch of duplicate code from the arch directories. Note this adds a lockdep assertion for slots_lock being held when calling kvm_flush_remote_tlbs_memslot(), which was previously only asserted on x86. MIPS has calls to kvm_flush_remote_tlbs_memslot(), but they all hold the slots_lock, so the lockdep assertion continues to hold true. Also drop the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT ifdef gating kvm_flush_remote_tlbs_memslot(), since it is no longer necessary. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-7-rananta@google.com |
||
![]() |
eb3515dc99 |
x86: Move gds_ucode_mitigated() declaration to header
The declaration got placed in the .c file of the caller, but that
causes a warning for the definition:
arch/x86/kernel/cpu/bugs.c:682:6: error: no previous prototype for 'gds_ucode_mitigated' [-Werror=missing-prototypes]
Move it to a header where both sides can observe it instead.
Fixes:
|
||
![]() |
64094e7e31 |
Mitigate Gather Data Sampling issue
* Add Base GDS mitigation * Support GDS_NO under KVM * Fix a documentation typo -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTJh5YACgkQaDWVMHDJ krAzAw/8DzjhAYEa7a1AodCBMNg8uNOPnLNoRPPNhaN5Iw6W3zXYDBDKT9PyjAIx RoIM0aHx/oY9nCpK441o25oCWAAyzk6E5/+q9hMa7B4aHUGKqiDUC6L9dC8UiiSN yvoBv4g7F81QnmyazwYI64S6vnbr4Cqe7K/mvVqQ/vbJiugD25zY8mflRV9YAuMk Oe7Ff/mCA+I/kqyKhJE3cf3qNhZ61FsFI886fOSvIE7g4THKqo5eGPpIQxR4mXiU Ri2JWffTaeHr2m0sAfFeLH4VTZxfAgBkNQUEWeG6f2kDGTEKibXFRsU4+zxjn3gl xug+9jfnKN1ceKyNlVeJJZKAfr2TiyUtrlSE5d+subIRKKBaAGgnCQDasaFAluzd aZkOYz30PCebhN+KTrR84FySHCaxnev04jqdtVGAQEDbTvyNagFUdZFGhWijJShV l2l4A0gFSYJmPfPVuuAwOJnnZtA1sRH9oz/Sny3+z9BKloZh+Nc/+Cu9zC8SLjaU BF3Qv2gU9HKTJ+MSy2JrGS52cONfpO5ngFHoOMilZ1KBHrfSb1eiy32PDT+vK60Y PFEmI8SWl7bmrO1snVUCfGaHBsHJSu5KMqwBGmM4xSRzJpyvRe493xC7+nFvqNLY vFOFc4jGeusOXgiLPpfGduppkTGcM7sy75UMLwTSLcQbDK99mus= =ZAPY -----END PGP SIGNATURE----- Merge tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/gds fixes from Dave Hansen: "Mitigate Gather Data Sampling issue: - Add Base GDS mitigation - Support GDS_NO under KVM - Fix a documentation typo" * tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation/x86: Fix backwards on/off logic about YMM support KVM: Add GDS_NO support to KVM x86/speculation: Add Kconfig option for GDS x86/speculation: Add force option to GDS mitigation x86/speculation: Add Gather Data Sampling mitigation |
||
![]() |
2d63699099 |
KVM: x86: Always write vCPU's current TSC offset/ratio in vendor hooks
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for propagating TSC offsets/multipliers into hardware, and instead have the vendor implementations pull the information directly from the vCPU structure. The respective vCPU fields _must_ be written at the same time in order to maintain consistent state, i.e. it's not random luck that the value passed in by all callers is grabbed from the vCPU. Explicitly grabbing the value from the vCPU field in SVM's implementation in particular will allow for additional cleanup without introducing even more subtle dependencies. Specifically, SVM can skip the WRMSR if guest state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the correct value for the vCPU prior to entering the guest. This also reconciles KVM's handling of related values that are stored in the vCPU, as svm_write_tsc_offset() already assumes/requires the caller to have updated l1_tsc_offset. Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
a2fd5d02ba |
KVM: x86: Snapshot host's MSR_IA32_ARCH_CAPABILITIES
Snapshot the host's MSR_IA32_ARCH_CAPABILITIES, if it's supported, instead of reading the MSR every time KVM wants to query the host state, e.g. when initializing the default value during vCPU creation. The paths that query ARCH_CAPABILITIES aren't particularly performance sensitive, but creating vCPUs is a frequent enough operation that burning 8 bytes is a good trade-off. Alternatively, KVM could add a field in kvm_caps and thus skip the on-demand calculations entirely, but a pure snapshot isn't possible due to the way KVM handles the l1tf_vmx_mitigation module param. And unlike the other "supported" fields in kvm_caps, KVM doesn't enforce the "supported" value, i.e. KVM treats ARCH_CAPABILITIES like a CPUID leaf and lets userspace advertise whatever it wants. Those problems are solvable, but it's not clear there is real benefit versus snapshotting the host value, and grabbing the host value will allow additional cleanup of KVM's FB_CLEAR_CTRL code. Link: https://lore.kernel.org/all/20230524061634.54141-2-chao.gao@intel.com Cc: Chao Gao <chao.gao@intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230607004311.1420507-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
7f717f5484 |
KVM: x86: Remove x86_emulate_ops::guest_has_long_mode
Remove x86_emulate_ops::guest_has_long_mode along with its implementation,
emulator_guest_has_long_mode(). It has been unused since commit
|
||
![]() |
0d033770d4 |
KVM: x86: Fix KVM_CAP_SYNC_REGS's sync_regs() TOCTOU issues
In a spirit of using a sledgehammer to crack a nut, make sync_regs() feed
__set_sregs() and kvm_vcpu_ioctl_x86_set_vcpu_events() with kernel's own
copy of data.
Both __set_sregs() and kvm_vcpu_ioctl_x86_set_vcpu_events() assume they
have exclusive rights to structs they operate on. While this is true when
coming from an ioctl handler (caller makes a local copy of user's data),
sync_regs() breaks this contract; a pointer to a user-modifiable memory
(vcpu->run->s.regs) is provided. This can lead to a situation when incoming
data is checked and/or sanitized only to be re-set by a user thread running
in parallel.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Fixes:
|
||
![]() |
26a0652cb4 |
KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalid
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid, e.g. due to setting bits 63:32, illegal combinations, or to a value that isn't allowed in VMX (non-)root mode. The VMX checks in particular are "fun" as failure to disallow Real Mode for an L2 that is configured with unrestricted guest disabled, when KVM itself has unrestricted guest enabled, will result in KVM forcing VM86 mode to virtual Real Mode for L2, but then fail to unwind the related metadata when synthesizing a nested VM-Exit back to L1 (which has unrestricted guest enabled). Opportunistically fix a benign typo in the prototype for is_valid_cr4(). Cc: stable@vger.kernel.org Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230613203037.1968489-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
3f2739bd1e |
KVM: x86: Acquire SRCU read lock when handling fastpath MSR writes
Temporarily acquire kvm->srcu for read when potentially emulating WRMSR in the VM-Exit fastpath handler, as several of the common helpers used during emulation expect the caller to provide SRCU protection. E.g. if the guest is counting instructions retired, KVM will query the PMU event filter when stepping over the WRMSR. dump_stack+0x85/0xdf lockdep_rcu_suspicious+0x109/0x120 pmc_event_is_allowed+0x165/0x170 kvm_pmu_trigger_event+0xa5/0x190 handle_fastpath_set_msr_irqoff+0xca/0x1e0 svm_vcpu_run+0x5c3/0x7b0 [kvm_amd] vcpu_enter_guest+0x2108/0x2580 Alternatively, check_pmu_event_filter() could acquire kvm->srcu, but this isn't the first bug of this nature, e.g. see commit |
||
![]() |
5e1fe4a21c |
KVM: x86/irq: Conditionally register IRQ bypass consumer again
As was attempted commit
|
||
![]() |
bf672720e8 |
KVM: x86: check the kvm_cpu_get_interrupt result before using it
The code was blindly assuming that kvm_cpu_get_interrupt never returns -1 when there is a pending interrupt. While this should be true, a bug in KVM can still cause this. If -1 is returned, the code before this patch was converting it to 0xFF, and 0xFF interrupt was injected to the guest, which results in an issue which was hard to debug. Add WARN_ON_ONCE to catch this case and skip the injection if this happens again. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230726135945.260841-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
![]() |
81ac7e5d74 |
KVM: Add GDS_NO support to KVM
Gather Data Sampling (GDS) is a transient execution attack using gather instructions from the AVX2 and AVX512 extensions. This attack allows malicious code to infer data that was previously stored in vector registers. Systems that are not vulnerable to GDS will set the GDS_NO bit of the IA32_ARCH_CAPABILITIES MSR. This is useful for VM guests that may think they are on vulnerable systems that are, in fact, not affected. Guests that are running on affected hosts where the mitigation is enabled are protected as if they were running on an unaffected system. On all hosts that are not affected or that are mitigated, set the GDS_NO bit. Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> |
||
![]() |
e8069f5a8e |
ARM64:
* Eager page splitting optimization for dirty logging, optionally allowing for a VM to avoid the cost of hugepage splitting in the stage-2 fault path. * Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with services that live in the Secure world. pKVM intervenes on FF-A calls to guarantee the host doesn't misuse memory donated to the hyp or a pKVM guest. * Support for running the split hypervisor with VHE enabled, known as 'hVHE' mode. This is extremely useful for testing the split hypervisor on VHE-only systems, and paves the way for new use cases that depend on having two TTBRs available at EL2. * Generalized framework for configurable ID registers from userspace. KVM/arm64 currently prevents arbitrary CPU feature set configuration from userspace, but the intent is to relax this limitation and allow userspace to select a feature set consistent with the CPU. * Enable the use of Branch Target Identification (FEAT_BTI) in the hypervisor. * Use a separate set of pointer authentication keys for the hypervisor when running in protected mode, as the host is untrusted at runtime. * Ensure timer IRQs are consistently released in the init failure paths. * Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps (FEAT_EVT), as it is a register commonly read from userspace. * Erratum workaround for the upcoming AmpereOne part, which has broken hardware A/D state management. RISC-V: * Redirect AMO load/store misaligned traps to KVM guest * Trap-n-emulate AIA in-kernel irqchip for KVM guest * Svnapot support for KVM Guest s390: * New uvdevice secret API * CMM selftest and fixes * fix racy access to target CPU for diag 9c x86: * Fix missing/incorrect #GP checks on ENCLS * Use standard mmu_notifier hooks for handling APIC access page * Drop now unnecessary TR/TSS load after VM-Exit on AMD * Print more descriptive information about the status of SEV and SEV-ES during module load * Add a test for splitting and reconstituting hugepages during and after dirty logging * Add support for CPU pinning in demand paging test * Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes included along the way * Add a "nx_huge_pages=never" option to effectively avoid creating NX hugepage recovery threads (because nx_huge_pages=off can be toggled at runtime) * Move handling of PAT out of MTRR code and dedup SVM+VMX code * Fix output of PIC poll command emulation when there's an interrupt * Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. * Misc cleanups, fixes and comments Generic: * Miscellaneous bugfixes and cleanups Selftests: * Generate dependency files so that partial rebuilds work as expected -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmSgHrIUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroORcAf+KkBlXwQMf+Q0Hy6Mfe0OtkKmh0Ae 6HJ6dsuMfOHhWv5kgukh+qvuGUGzHq+gpVKmZg2yP3h3cLHOLUAYMCDm+rjXyjsk F4DbnJLfxq43Pe9PHRKFxxSecRcRYCNox0GD5UYL4PLKcH0FyfQrV+HVBK+GI8L3 FDzUcyJkR12Lcj1qf++7fsbzfOshL0AJPmidQCoc6wkLJpUEr/nYUqlI1Kx3YNuQ LKmxFHS4l4/O/px3GKNDrLWDbrVlwciGIa3GZLS52PZdW3mAqT+cqcPcYK6SW71P m1vE80VbNELX5q3YSRoOXtedoZ3Pk97LEmz/xQAsJ/jri0Z5Syk0Ok0m/Q== =AMXp -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM64: - Eager page splitting optimization for dirty logging, optionally allowing for a VM to avoid the cost of hugepage splitting in the stage-2 fault path. - Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with services that live in the Secure world. pKVM intervenes on FF-A calls to guarantee the host doesn't misuse memory donated to the hyp or a pKVM guest. - Support for running the split hypervisor with VHE enabled, known as 'hVHE' mode. This is extremely useful for testing the split hypervisor on VHE-only systems, and paves the way for new use cases that depend on having two TTBRs available at EL2. - Generalized framework for configurable ID registers from userspace. KVM/arm64 currently prevents arbitrary CPU feature set configuration from userspace, but the intent is to relax this limitation and allow userspace to select a feature set consistent with the CPU. - Enable the use of Branch Target Identification (FEAT_BTI) in the hypervisor. - Use a separate set of pointer authentication keys for the hypervisor when running in protected mode, as the host is untrusted at runtime. - Ensure timer IRQs are consistently released in the init failure paths. - Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps (FEAT_EVT), as it is a register commonly read from userspace. - Erratum workaround for the upcoming AmpereOne part, which has broken hardware A/D state management. RISC-V: - Redirect AMO load/store misaligned traps to KVM guest - Trap-n-emulate AIA in-kernel irqchip for KVM guest - Svnapot support for KVM Guest s390: - New uvdevice secret API - CMM selftest and fixes - fix racy access to target CPU for diag 9c x86: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Drop now unnecessary TR/TSS load after VM-Exit on AMD - Print more descriptive information about the status of SEV and SEV-ES during module load - Add a test for splitting and reconstituting hugepages during and after dirty logging - Add support for CPU pinning in demand paging test - Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes included along the way - Add a "nx_huge_pages=never" option to effectively avoid creating NX hugepage recovery threads (because nx_huge_pages=off can be toggled at runtime) - Move handling of PAT out of MTRR code and dedup SVM+VMX code - Fix output of PIC poll command emulation when there's an interrupt - Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. - Misc cleanups, fixes and comments Generic: - Miscellaneous bugfixes and cleanups Selftests: - Generate dependency files so that partial rebuilds work as expected" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (153 commits) Documentation/process: Add a maintainer handbook for KVM x86 Documentation/process: Add a label for the tip tree handbook's coding style KVM: arm64: Fix misuse of KVM_ARM_VCPU_POWER_OFF bit index RISC-V: KVM: Remove unneeded semicolon RISC-V: KVM: Allow Svnapot extension for Guest/VM riscv: kvm: define vcpu_sbi_ext_pmu in header RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip RISC-V: KVM: Add in-kernel emulation of AIA APLIC RISC-V: KVM: Implement device interface for AIA irqchip RISC-V: KVM: Skeletal in-kernel AIA irqchip support RISC-V: KVM: Set kvm_riscv_aia_nr_hgei to zero RISC-V: KVM: Add APLIC related defines RISC-V: KVM: Add IMSIC related defines RISC-V: KVM: Implement guest external interrupt line management KVM: x86: Remove PRIx* definitions as they are solely for user space s390/uv: Update query for secret-UVCs s390/uv: replace scnprintf with sysfs_emit s390/uvdevice: Add 'Lock Secret Store' UVC ... |
||
![]() |
255006adb3 |
KVM VMX changes for 6.5:
- Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Misc cleanups -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaLDYSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5ovYP/ib86UG9QXwoEKx0mIyLQ5q1jD+StvxH 18SIH62+MXAtmz2E+EmXIySW76diOKCngApJ11WTERPwpZYEpcITh2D2Jp/vwgk5 xUPK+WKYQs1SGpJu3wXhLE1u6mB7X9p7EaXRSKG67P7YK09gTaOik1/3h6oNrGO+ KI06reCQN1PstKTfrZXxYpRlfDc761YaAmSZ79Bg+bK9PisFqme7TJ2mAqNZPFPd E7ho/UOEyWRSyd5VMsuOUB760pMQ9edKrs+38xNDp5N+0Fh0ItTjuAcd2KVWMZyW Fk+CJq4kCqTlEik5OwcEHsTGJGBFscGPSO+T0YtVfSZDdtN/rHN7l8RGquOebVTG Ldm5bg4agu4lXsqqzMxn8J9SkbNg3xno79mMSc2185jS2HLt5Hu6PzQnQ2tEtHJQ IuovmssHOVKDoYODOg0tq8UMydgT3hAvC7YJCouubCjxUUw+22nhN3EDuAhbJhtT DgQNGT7GmsrKIWLEjbm6EpLLOdJdB7/U1MrEshLS015a/DUz4b3ZGYApneifJL8h nGE2Wu+36xGUVNLgDMdvd+R17WdyQa+f+9KjUGy71KelFV4vI4A3JwvH0aIsTyHZ LGlQBZqelc66GYwMiqVC0GYGRtrdgygQopfstvZJ3rYiHZV/mdhB5A0T4J2Xvh2Q bnDNzsSFdsH5 =PjYj -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmx-6.5' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.5: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Misc cleanups |
||
![]() |
751d77fefa |
KVM x86/pmu changes for 6.5:
- Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes included along the way -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaHFgSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5twMP/15ZJFqZVigVQoATJeeR9tWUuyJe95xM lyfnTel91Sg8XOamdwBGi7jLpaDgj34Jm0cfM7/4LbJk2/taeaCLYmJd5w9FXvaw EkytQGO85hVNe2XuY+h+XxSIxpflKxgFuUnOwcDk2QbKgASzNSG/mJ9ZBx8PNVXD FnyOqpbbYDFspWWvUOAI/RkHnr/dALjXJsSUMvuh3nz5e1NTyubjCAZg+/bse2nR s8FrcSh4B0Lg0h4r2fdJ4sAiM/qWhcCIhq5svyTAcUG0T4rMS40LrosJOw3wkBRM dyZYXy6GEENeCFJPhenF1mTE1embFyZp89PV/FCNRZXODbnM4kheJFT9gucAjlKi ZafRcutrkYIVf4lZCMofDfQGLX/GCEJnwUPKyGygIsPoDRrdR7OLrFycON5bxocr 9NBNG+2teQFbnt5irB/bBGojtIZtu3OEylkuRjQUQ3lJYQ5r6LddarI9acIu1SHt 4rRfh8QN5qmMvVblaQzggOr6BPtmPr8QqMEMFncaUMCsV/82hRAEfvj2rifGFJNo Axz1ajMfirxyM45WzredUkzzsbphiiegPBELCLRZfHmaEhJ8P7t7wvri0bXt9YdI vjSfX+6ulOgDC+xAazE0gEJO4Uh5+g3Y+1e0fr43ltWzUOWdCQskzD3LE9DkqIXj KAaCuHYbYpIZ =MwqV -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pmu-6.5' of https://github.com/kvm-x86/linux into HEAD KVM x86/pmu changes for 6.5: - Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes included along the way |
||
![]() |
36b68d360a |
KVM x86 changes for 6.5:
- Move handling of PAT out of MTRR code and dedup SVM+VMX code - Fix output of PIC poll command emulation when there's an interrupt - Fix a longstanding bug in the reporting of the number of entries returned by KVM_GET_CPUID2 - Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. - Misc cleanups -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaGMMSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5iDIP/0PwY3J5odTEUTnAyuDFPimd5PBt9k/O B414wdpSKVgzq+0An4qM9mKRnklVIh2p8QqQTvDhcBUg3xb6CX9xZ4ery7hp/T5O tr5bAXs2AYX6jpxvsopt+w+E9j6fvkJhcJCRU9im3QbrqwUE+ecyU5OHvmv2n/GO syVZJbPOYuoLPKDjlSMrScE6fWEl9UOvHc5BK/vafTeyisMG3vv1BSmJj6GuiNNk TS1RRIg//cOZghQyDfdXt0azTmakNZyNn35xnoX9x8SRmdRykyUjQeHmeqWxPDso kiGO+CGancfS57S6ZtCkJjqEWZ1o/zKdOxr8MMf/3nJhv4kY7/5XtlVoACv5soW9 bZEmNiXIaSbvKNMwAlLJxHFbLa1sMdSCb345CIuMdt5QiWJ53ZiTyIAJX6+eL+Zf 8nkeekgPf5VUs6Zt0RdRPyvo+W7Vp9BtI87yDXm1nQKpbys2pt6CD3YB/oF4QViG a5cyGoFuqRQbS3nmbshIlR7EanTuxbhLZKrNrFnolZ5e624h3Cnk2hVsfTznVGiX vNHWM80phk1CWB9McErrZVkGfjlyVyBL13CBB2XF7Dl6PfF6/N22a9bOuTJD3tvk PlNx4hvZm3esvvyGpjfbSajTKYE8O7rxiE1KrF0BpZ5IUl5WSiTr6XCy/yI/mIeM hay2IWhPOF2z =D0BH -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.5' of https://github.com/kvm-x86/linux into HEAD KVM x86 changes for 6.5: * Move handling of PAT out of MTRR code and dedup SVM+VMX code * Fix output of PIC poll command emulation when there's an interrupt * Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. * Misc cleanups |
||
![]() |
bc6cb4d5bc |
Locking changes for v6.5:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double(). The cmpxchg128() family of functions is basically & functionally the same as cmpxchg_double(), but with a saner interface: instead of a 6-parameter horror that forced u128 - u64/u64-halves layout details on the interface and exposed users to complexity, fragility & bugs, use a natural 3-parameter interface with u128 types. - Restructure the generated atomic headers, and add kerneldoc comments for all of the generic atomic{,64,_long}_t operations. Generated definitions are much cleaner now, and come with documentation. - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when taking multiple locks of the same type. This gets rid of one use of lockdep_set_novalidate_class() in the bcache code. - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable shadowing generating garbage code on Clang on certain ARM builds. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig 8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0 YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ 7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC x9Nt+Tp0Ze4= =DsYj -----END PGP SIGNATURE----- Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() The cmpxchg128() family of functions is basically & functionally the same as cmpxchg_double(), but with a saner interface. Instead of a 6-parameter horror that forced u128 - u64/u64-halves layout details on the interface and exposed users to complexity, fragility & bugs, use a natural 3-parameter interface with u128 types. - Restructure the generated atomic headers, and add kerneldoc comments for all of the generic atomic{,64,_long}_t operations. The generated definitions are much cleaner now, and come with documentation. - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when taking multiple locks of the same type. This gets rid of one use of lockdep_set_novalidate_class() in the bcache code. - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable shadowing generating garbage code on Clang on certain ARM builds. * tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg() locking/atomic: treewide: delete arch_atomic_*() kerneldoc locking/atomic: docs: Add atomic operations to the driver basic API documentation locking/atomic: scripts: generate kerneldoc comments docs: scripts: kernel-doc: accept bitwise negation like ~@var locking/atomic: scripts: simplify raw_atomic*() definitions locking/atomic: scripts: simplify raw_atomic_long*() definitions locking/atomic: scripts: split pfx/name/sfx/order locking/atomic: scripts: restructure fallback ifdeffery locking/atomic: scripts: build raw_atomic_long*() directly locking/atomic: treewide: use raw_atomic*_<op>() locking/atomic: scripts: add trivial raw_atomic*_<op>() locking/atomic: scripts: factor out order template generation locking/atomic: scripts: remove leftover "${mult}" locking/atomic: scripts: remove bogus order parameter locking/atomic: xtensa: add preprocessor symbols locking/atomic: x86: add preprocessor symbols locking/atomic: sparc: add preprocessor symbols locking/atomic: sh: add preprocessor symbols ... |
||
![]() |
ed3b7923a8 |
Scheduler changes for v6.5:
- Scheduler SMP load-balancer improvements: - Avoid unnecessary migrations within SMT domains on hybrid systems. Problem: On hybrid CPU systems, (processors with a mixture of higher-frequency SMT cores and lower-frequency non-SMT cores), under the old code lower-priority CPUs pulled tasks from the higher-priority cores if more than one SMT sibling was busy - resulting in many unnecessary task migrations. Solution: The new code improves the load balancer to recognize SMT cores with more than one busy sibling and allows lower-priority CPUs to pull tasks, which avoids superfluous migrations and lets lower-priority cores inspect all SMT siblings for the busiest queue. - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection. This improves CPU utilization for certain workloads, while leaves other key workloads unchanged. - Scheduler infrastructure improvements: - Rewrite the scheduler topology setup code by consolidating it into the build_sched_topology() helper function and building it dynamically on the fly. - Resolve the local_clock() vs. noinstr complications by rewriting the code: provide separate sched_clock_noinstr() and local_clock_noinstr() functions to be used in instrumentation code, and make sure it is all instrumentation-safe. - Fixes: - Fix a kthread_park() race with wait_woken() - Fix misc wait_task_inactive() bugs unearthed by the -rt merge: - Fix UP PREEMPT bug by unifying the SMP and UP implementations. - Fix task_struct::saved_state handling. - Fix various rq clock update bugs, unearthed by turning on the rq clock debugging code. - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by creating enough cgroups, by removing the warnign and restricting window size triggers to PSI file write-permission or CAP_SYS_RESOURCE. - Propagate SMT flags in the topology when removing degenerate domain - Fix grub_reclaim() calculation bug in the deadline scheduler code - Avoid resetting the min update period when it is unnecessary, in psi_trigger_destroy(). - Don't balance a task to its current running CPU in load_balance(), which was possible on certain NUMA topologies with overlapping groups. - Fix the sched-debug printing of rq->nr_uninterruptible - Cleanups: - Address various -Wmissing-prototype warnings, as a preparation to (maybe) enable this warning in the future. - Remove unused code - Mark more functions __init - Fix shadow-variable warnings Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSatWQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1j62xAAuGOx1LcDfRGC6WGQzp1zOdlsVQtnDvlS qL58zYSHgizprpVQ3j87SBaG4CHCdvd2Bo36yW0lNZS4nd203qdq7fkrMb3hPP/w egUQUzMegf5fF6BWldKeMjuHSt+twFQz/ZAKK8iSbAir6CHNAqbNst1oL0i/+Tyk o33hBs1hT5tnbFb1NSVZkX4k+qT3LzTW4K2QgjjGtkScr6yHh2BdEVefyigWOjdo 9s02d00ll9a2r+F5txlN7Dnw6TN7rmTXGMOJU5bZvBE90/anNiAorMXHJdEKCyUR u9+JtBdJWiCplGa/tSRcxT16ZW1VdtTnd9q66TDhXREd2UNDFqBEyg5Wl77K4Tlf vKFajmj/to+cTbuv6m6TVR+zyXpdEpdL6F04P44U3qiJvDobBqeDNKHHIqpmbHXl AXUXcPWTVAzXX1Ce5M+BeAgTBQ1T7C5tELILrTNQHJvO1s9VVBRFZ/l65Ps4vu7T wIZ781IFuopk0zWqHovNvgKrJ7oFmOQQZFttQEe8n6nafkjI7u+IZ8FayiGaUMRr 4GawFGUCEdYh8z9qyslGKe8Q/Rphfk6hxMFRYUJpDmubQ0PkMeDjDGq77jDGl1PF VqwSDEyOaBJs7Gqf/mem00JtzBmXhkhm1SEjggHMI2IQbr/eeBXoLQOn3CDapO/N PiDbtX760ic= =EWQA -----END PGP SIGNATURE----- Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scheduler SMP load-balancer improvements: - Avoid unnecessary migrations within SMT domains on hybrid systems. Problem: On hybrid CPU systems, (processors with a mixture of higher-frequency SMT cores and lower-frequency non-SMT cores), under the old code lower-priority CPUs pulled tasks from the higher-priority cores if more than one SMT sibling was busy - resulting in many unnecessary task migrations. Solution: The new code improves the load balancer to recognize SMT cores with more than one busy sibling and allows lower-priority CPUs to pull tasks, which avoids superfluous migrations and lets lower-priority cores inspect all SMT siblings for the busiest queue. - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection. This improves CPU utilization for certain workloads, while leaves other key workloads unchanged. Scheduler infrastructure improvements: - Rewrite the scheduler topology setup code by consolidating it into the build_sched_topology() helper function and building it dynamically on the fly. - Resolve the local_clock() vs. noinstr complications by rewriting the code: provide separate sched_clock_noinstr() and local_clock_noinstr() functions to be used in instrumentation code, and make sure it is all instrumentation-safe. Fixes: - Fix a kthread_park() race with wait_woken() - Fix misc wait_task_inactive() bugs unearthed by the -rt merge: - Fix UP PREEMPT bug by unifying the SMP and UP implementations - Fix task_struct::saved_state handling - Fix various rq clock update bugs, unearthed by turning on the rq clock debugging code. - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by creating enough cgroups, by removing the warnign and restricting window size triggers to PSI file write-permission or CAP_SYS_RESOURCE. - Propagate SMT flags in the topology when removing degenerate domain - Fix grub_reclaim() calculation bug in the deadline scheduler code - Avoid resetting the min update period when it is unnecessary, in psi_trigger_destroy(). - Don't balance a task to its current running CPU in load_balance(), which was possible on certain NUMA topologies with overlapping groups. - Fix the sched-debug printing of rq->nr_uninterruptible Cleanups: - Address various -Wmissing-prototype warnings, as a preparation to (maybe) enable this warning in the future. - Remove unused code - Mark more functions __init - Fix shadow-variable warnings" * tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle() sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() sched/core: Fixed missing rq clock update before calling set_rq_offline() sched/deadline: Update GRUB description in the documentation sched/deadline: Fix bandwidth reclaim equation in GRUB sched/wait: Fix a kthread_park race with wait_woken() sched/topology: Mark set_sched_topology() __init sched/fair: Rename variable cpu_util eff_util arm64/arch_timer: Fix MMIO byteswap sched/fair, cpufreq: Introduce 'runnable boosting' sched/fair: Refactor CPU utilization functions cpuidle: Use local_clock_noinstr() sched/clock: Provide local_clock_noinstr() x86/tsc: Provide sched_clock_noinstr() clocksource: hyper-v: Provide noinstr sched_clock() clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX x86/vdso: Fix gettimeofday masking math64: Always inline u128 version of mul_u64_u64_shr() s390/time: Provide sched_clock_noinstr() loongarch: Provide noinstr sched_clock_read() ... |
||
![]() |
a306425708 |
KVM: x86: Update comments about MSR lists exposed to userspace
Refresh comments about msrs_to_save, emulated_msrs, and msr_based_features
to remove stale references left behind by commit
|
||
![]() |
4a2771895c |
KVM: x86/svm/pmu: Add AMD PerfMonV2 support
If AMD Performance Monitoring Version 2 (PerfMonV2) is detected by the guest, it can use a new scheme to manage the Core PMCs using the new global control and status registers. In addition to benefiting from the PerfMonV2 functionality in the same way as the host (higher precision), the guest also can reduce the number of vm-exits by lowering the total number of MSRs accesses. In terms of implementation details, amd_is_valid_msr() is resurrected since three newly added MSRs could not be mapped to one vPMC. The possibility of emulating PerfMonV2 on the mainframe has also been eliminated for reasons of precision. Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: drop "Based on the observed HW." comments] Link: https://lore.kernel.org/r/20230603011058.1038821-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
e12fa4b92a |
KVM: x86: Clean up: remove redundant bool conversions
As test_bit() returns bool, explicitly converting result to bool is unnecessary. Get rid of '!!'. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230605200158.118109-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
056b9919a1 |
KVM: x86: Use cpu_feature_enabled() for PKU instead of #ifdef
Replace an #ifdef on CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS with a cpu_feature_enabled() check on X86_FEATURE_PKU. The macro magic of DISABLED_MASK_BIT_SET() means that cpu_feature_enabled() provides the same end result (no code generated) when PKU is disabled by Kconfig. No functional change intended. Cc: Jon Kohler <jon@nutanix.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://lore.kernel.org/r/20230602010550.785722-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
0a8a5f2c8c |
KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page
Now that KVM honors past and in-progress mmu_notifier invalidations when reloading the APIC-access page, use KVM's "standard" invalidation hooks to trigger a reload and delete the one-off usage of invalidate_range(). Aside from eliminating one-off code in KVM, dropping KVM's use of invalidate_range() will allow common mmu_notifier to redefine the API to be more strictly focused on invalidating secondary TLBs that share the primary MMU's page tables. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
02f1b0b736 |
KVM: x86: Correct the name for skipping VMENTER l1d flush
There is no VMENTER_L1D_FLUSH_NESTED_VM. It should be ARCH_CAP_SKIP_VMENTRY_L1DFLUSH. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230524061634.54141-3-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
9397fa2ea3 |
clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
Currently hv_read_tsc_page_tsc() (ab)uses the (valid) time value of U64_MAX as an error return. This breaks the clean wrap-around of the clock. Modify the function signature to return a boolean state and provide another u64 pointer to store the actual time on success. This obviates the need to steal one time value and restores the full counter width. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.775630881@infradead.org |
||
![]() |
0f613bfa82 |
locking/atomic: treewide: use raw_atomic*_<op>()
Now that we have raw_atomic*_<op>() definitions, there's no need to use arch_atomic*_<op>() definitions outside of the low-level atomic definitions. Move treewide users of arch_atomic*_<op>() over to the equivalent raw_atomic*_<op>(). There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com |
||
![]() |
8b703a49c9 |
KVM: x86: Account fastpath-only VM-Exits in vCPU stats
Increment vcpu->stat.exits when handling a fastpath VM-Exit without
going through any part of the "slow" path. Not bumping the exits stat
can result in wildly misleading exit counts, e.g. if the primary reason
the guest is exiting is to program the TSC deadline timer.
Fixes:
|
||
![]() |
dee321977a |
KVM: x86: Move common handling of PAT MSR writes to kvm_set_msr_common()
Move the common check-and-set handling of PAT MSR writes out of vendor code and into kvm_set_msr_common(). This aligns writes with reads, which are already handled in common code, i.e. makes the handling of reads and writes symmetrical in common code. Alternatively, the common handling in kvm_get_msr_common() could be moved to vendor code, but duplicating code is generally undesirable (even though the duplicatated code is trivial in this case), and guest writes to PAT should be rare, i.e. the overhead of the extra function call is a non-issue in practice. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
bc7fe2f0b7 |
KVM: x86: Move PAT MSR handling out of mtrr.c
Drop handling of MSR_IA32_CR_PAT from mtrr.c now that SVM and VMX handle writes without bouncing through kvm_set_msr_common(). PAT isn't truly an MTRR even though it affects memory types, and more importantly KVM enables hardware virtualization of guest PAT (by NOT setting "ignore guest PAT") when a guest has non-coherent DMA, i.e. KVM doesn't need to zap SPTEs when the guest PAT changes. The read path is and always has been trivial, i.e. burying it in the MTRR code does more harm than good. WARN and continue for the PAT case in kvm_set_msr_common(), as that code is _currently_ reached if and only if KVM is buggy. Defer cleaning up the lack of symmetry between the read and write paths to a future patch. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
34a83deac3 |
KVM: x86: Use MTRR macros to define possible MTRR MSR ranges
Use the MTRR macros to identify the ranges of possible MTRR MSRs instead of bounding the ranges with a mismash of open coded values and unrelated MSR indices. Carving out the gap for the machine check MSRs in particular is confusing, as it's easy to incorrectly think the case statement handles MCE MSRs instead of skipping them. Drop the range-based funneling of MSRs between the end of the MCE MSRs and MTRR_DEF_TYPE, i.e. 0x2A0-0x2FF, and instead handle MTTR_DEF_TYPE as the one-off case that it is. Extract PAT (0x277) as well in anticipation of dropping PAT "handling" from the MTRR code. Keep the range-based handling for the variable+fixed MTRRs even though capturing unknown MSRs 0x214-0x24F is arguably "wrong". There is a gap in the fixed MTRRs, 0x260-0x267, i.e. the MTRR code needs to filter out unknown MSRs anyways, and using a single range generates marginally better code for the big switch statement. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
![]() |
b9846a698c |
KVM: VMX: add MSR_IA32_TSX_CTRL into msrs_to_save
Add MSR_IA32_TSX_CTRL into msrs_to_save[] to explicitly tell userspace to
save/restore the register value during migration. Missing this may cause
userspace that relies on KVM ioctl(KVM_GET_MSR_INDEX_LIST) fail to port the
value to the target VM.
In addition, there is no need to add MSR_IA32_TSX_CTRL when
ARCH_CAP_TSX_CTRL_MSR is not supported in kvm_get_arch_capabilities(). So
add the checking in kvm_probe_msr_to_save().
Fixes:
|
||
![]() |
c8c655c34e |
s390:
* More phys_to_virt conversions * Improvement of AP management for VSIE (nested virtualization) ARM64: * Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. * New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. * Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. * A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. * The usual selftest fixes and improvements. KVM x86 changes for 6.4: * Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) * Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool * Move AMD_PSFD to cpufeatures.h and purge KVM's definition * Avoid unnecessary writes+flushes when the guest is only adding new PTEs * Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations * Clean up the range-based flushing APIs * Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry * Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() * Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware * Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features * Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES * Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest x86 AMD: * Add support for virtual NMIs * Fixes for edge cases related to virtual interrupts x86 Intel: * Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() * Fix a bug in emulation of ENCLS in compatibility mode * Allow emulation of NOP and PAUSE for L2 * AMX selftests improvements * Misc cleanups MIPS: * Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: * Drop unnecessary casts from "void *" throughout kvm_main.c * Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: * Fix goof introduced by the conversion to rST -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRNExkUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNyjwf+MkzDael9y9AsOZoqhEZ5OsfQYJ32 Im5ZVYsPRU2K5TuoWql6meIihgclCj1iIU32qYHa2F1WYt2rZ72rJp+HoY8b+TaI WvF0pvNtqQyg3iEKUBKPA4xQ6mj7RpQBw86qqiCHmlfNt0zxluEGEPxH8xrWcfhC huDQ+NUOdU7fmJ3rqGitCvkUbCuZNkw3aNPR8dhU8RAWrwRzP2hBOmdxIeo81WWY XMEpJSijbGpXL9CvM0Jz9nOuMJwZwCCBGxg1vSQq0xTfLySNMxzvWZC2GFaBjucb j0UOQ7yE0drIZDVhd3sdNslubXXU6FcSEzacGQb9aigMUon3Tem9SHi7Kw== =S2Hq -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "s390: - More phys_to_virt conversions - Improvement of AP management for VSIE (nested virtualization) ARM64: - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. x86: - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) - Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool - Move AMD_PSFD to cpufeatures.h and purge KVM's definition - Avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features - Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - AMD SVM: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts - Intel AMX: - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() - Fix a bug in emulation of ENCLS in compatibility mode - Allow emulation of NOP and PAUSE for L2 - AMX selftests improvements - Misc cleanups MIPS: - Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: - Drop unnecessary casts from "void *" throughout kvm_main.c - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: - Fix goof introduced by the conversion to rST" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits) KVM: s390: pci: fix virtual-physical confusion on module unload/load KVM: s390: vsie: clarifications on setting the APCB KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init() KVM: selftests: Test the PMU event "Instructions retired" KVM: selftests: Copy full counter values from guest in PMU event filter test KVM: selftests: Use error codes to signal errors in PMU event filter test KVM: selftests: Print detailed info in PMU event filter asserts KVM: selftests: Add helpers for PMC asserts in PMU event filter test KVM: selftests: Add a common helper for the PMU event filter guest code KVM: selftests: Fix spelling mistake "perrmited" -> "permitted" KVM: arm64: vhe: Drop extra isb() on guest exit KVM: arm64: vhe: Synchronise with page table walker on MMU update KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc() KVM: arm64: nvhe: Synchronise with page table walker on TLBI KVM: arm64: Handle 32bit CNTPCTSS traps KVM: arm64: nvhe: Synchronise with page table walker on vcpu run KVM: arm64: vgic: Don't acquire its_lock before config_lock KVM: selftests: Add test to verify KVM's supported XCR0 ... |