2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

850 Commits

Author SHA1 Message Date
Ladi Prosek
72d7b374b1 KVM: x86: introduce ISA specific smi_allowed callback
Similar to NMI, there may be ISA specific reasons why an SMI cannot be
injected into the guest. This commit adds a new smi_allowed callback to
be implemented in following commits.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12 14:01:55 +02:00
Ladi Prosek
0234bf8852 KVM: x86: introduce ISA specific SMM entry/exit callbacks
Entering and exiting SMM may require ISA specific handling under certain
circumstances. This commit adds two new callbacks with empty implementations.
Actual functionality will be added in following commits.

* pre_enter_smm() is to be called when injecting an SMM, before any
  SMM related vcpu state has been changed
* pre_leave_smm() is to be called when emulating the RSM instruction,
  when the vcpu is in real mode and before any SMM related vcpu state
  has been restored

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12 14:01:55 +02:00
Thomas Gleixner
6406350583 x86/apic: Sanitize 32/64bit APIC callbacks
The 32bit and the 64bit implementation of default_cpu_present_to_apicid()
and default_check_phys_apicid_present() are exactly the same, but
implemented and located differently.

Move them to common apic code and get rid of the pointless difference.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Yu Chen <yu.c.chen@intel.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rui Zhang <rui.zhang@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Len Brown <lenb@kernel.org>
Link: https://lkml.kernel.org/r/20170913213153.757329991@linutronix.de
2017-09-25 20:51:50 +02:00
Suravee Suthikulpanit
b2a05feff2 KVM: Add struct kvm_vcpu pointer parameter to get_enable_apicv()
Modify struct kvm_x86_ops.arch.apicv_active() to take struct kvm_vcpu
pointer as parameter in preparation to subsequent changes.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-09-13 18:29:06 +02:00
Joerg Roedel
98152b83e0 KVM: x86: Remove .get_pkru() from kvm_x86_ops
The commit

	9dd21e104bc ('KVM: x86: simplify handling of PKRU')

removed all users and providers of that call-back, but
didn't remove it. Remove it now.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-09-13 15:33:04 +02:00
Linus Torvalds
0756b7fbb6 First batch of KVM changes for 4.14
Common:
  - improve heuristic for boosting preempted spinlocks by ignoring VCPUs
    in user mode
 
 ARM:
  - fix for decoding external abort types from guests
 
  - added support for migrating the active priority of interrupts when
    running a GICv2 guest on a GICv3 host
 
  - minor cleanup
 
 PPC:
  - expose storage keys to userspace
 
  - merge powerpc/topic/ppc-kvm branch that contains
    find_linux_pte_or_hugepte and POWER9 thread management cleanup
 
  - merge kvm-ppc-fixes with a fix that missed 4.13 because of vacations
 
  - fixes
 
 s390:
  - merge of topic branch tlb-flushing from the s390 tree to get the
    no-dat base features
 
  - merge of kvm/master to avoid conflicts with additional sthyi fixes
 
  - wire up the no-dat enhancements in KVM
 
  - multiple epoch facility (z14 feature)
 
  - Configuration z/Architecture Mode
 
  - more sthyi fixes
 
  - gdb server range checking fix
 
  - small code cleanups
 
 x86:
  - emulate Hyper-V TSC frequency MSRs
 
  - add nested INVPCID
 
  - emulate EPTP switching VMFUNC
 
  - support Virtual GIF
 
  - support 5 level page tables
 
  - speedup nested VM exits by packing byte operations
 
  - speedup MMIO by using hardware provided physical address
 
  - a lot of fixes and cleanups, especially nested
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJZspE1AAoJEED/6hsPKofoDcMIALT11n+LKV50QGwQdg2W1GOt
 aChbgnj/Kegit3hQlDhVNb8kmdZEOZzSL81Lh0VPEr7zXU8QiWn2snbizDPv8sde
 MpHhcZYZZ0YrpoiZKjl8yiwcu88OWGn2qtJ7OpuTS5hvEGAfxMncp0AMZho6fnz/
 ySTwJ9GK2MTgBw39OAzCeDOeoYn4NKYMwjJGqBXRhNX8PG/1wmfqv0vPrd6wfg31
 KJ58BumavwJjr8YbQ1xELm9rpQrAmaayIsG0R1dEUqCbt5a1+t2gt4h2uY7tWcIv
 ACt2bIze7eF3xA+OpRs+eT+yemiH3t9btIVmhCfzUpnQ+V5Z55VMSwASLtTuJRQ=
 =R8Ry
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "First batch of KVM changes for 4.14

  Common:
   - improve heuristic for boosting preempted spinlocks by ignoring
     VCPUs in user mode

  ARM:
   - fix for decoding external abort types from guests

   - added support for migrating the active priority of interrupts when
     running a GICv2 guest on a GICv3 host

   - minor cleanup

  PPC:
   - expose storage keys to userspace

   - merge kvm-ppc-fixes with a fix that missed 4.13 because of
     vacations

   - fixes

  s390:
   - merge of kvm/master to avoid conflicts with additional sthyi fixes

   - wire up the no-dat enhancements in KVM

   - multiple epoch facility (z14 feature)

   - Configuration z/Architecture Mode

   - more sthyi fixes

   - gdb server range checking fix

   - small code cleanups

  x86:
   - emulate Hyper-V TSC frequency MSRs

   - add nested INVPCID

   - emulate EPTP switching VMFUNC

   - support Virtual GIF

   - support 5 level page tables

   - speedup nested VM exits by packing byte operations

   - speedup MMIO by using hardware provided physical address

   - a lot of fixes and cleanups, especially nested"

* tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
  KVM: arm/arm64: Support uaccess of GICC_APRn
  KVM: arm/arm64: Extract GICv3 max APRn index calculation
  KVM: arm/arm64: vITS: Drop its_ite->lpi field
  KVM: arm/arm64: vgic: constify seq_operations and file_operations
  KVM: arm/arm64: Fix guest external abort matching
  KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fd
  KVM: s390: vsie: cleanup mcck reinjection
  KVM: s390: use WARN_ON_ONCE only for checking
  KVM: s390: guestdbg: fix range check
  KVM: PPC: Book3S HV: Report storage key support to userspace
  KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9
  KVM: PPC: Book3S HV: Fix invalid use of register expression
  KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation
  KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER
  KVM: PPC: e500mc: Fix a NULL dereference
  KVM: PPC: e500: Fix some NULL dereferences on error
  KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list
  KVM: s390: we are always in czam mode
  KVM: s390: expose no-DAT to guest and migration support
  KVM: s390: sthyi: remove invalid guest write access
  ...
2017-09-08 15:18:36 -07:00
Radim Krčmář
5f54c8b2d4 Merge branch 'kvm-ppc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
This fix was intended for 4.13, but didn't get in because both
maintainers were on vacation.

Paul Mackerras:
 "It adds mutual exclusion between list_add_rcu and list_del_rcu calls
  on the kvm->arch.spapr_tce_tables list.  Without this, userspace could
  potentially trigger corruption of the list and cause a host crash or
  worse."
2017-09-08 14:40:43 +02:00
Linus Torvalds
b1b6f83ac9 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
 "PCID support, 5-level paging support, Secure Memory Encryption support

  The main changes in this cycle are support for three new, complex
  hardware features of x86 CPUs:

   - Add 5-level paging support, which is a new hardware feature on
     upcoming Intel CPUs allowing up to 128 PB of virtual address space
     and 4 PB of physical RAM space - a 512-fold increase over the old
     limits. (Supercomputers of the future forecasting hurricanes on an
     ever warming planet can certainly make good use of more RAM.)

     Many of the necessary changes went upstream in previous cycles,
     v4.14 is the first kernel that can enable 5-level paging.

     This feature is activated via CONFIG_X86_5LEVEL=y - disabled by
     default.

     (By Kirill A. Shutemov)

   - Add 'encrypted memory' support, which is a new hardware feature on
     upcoming AMD CPUs ('Secure Memory Encryption', SME) allowing system
     RAM to be encrypted and decrypted (mostly) transparently by the
     CPU, with a little help from the kernel to transition to/from
     encrypted RAM. Such RAM should be more secure against various
     attacks like RAM access via the memory bus and should make the
     radio signature of memory bus traffic harder to intercept (and
     decrypt) as well.

     This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled
     by default.

     (By Tom Lendacky)

   - Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a
     hardware feature that attaches an address space tag to TLB entries
     and thus allows to skip TLB flushing in many cases, even if we
     switch mm's.

     (By Andy Lutomirski)

  All three of these features were in the works for a long time, and
  it's coincidence of the three independent development paths that they
  are all enabled in v4.14 at once"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (65 commits)
  x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)
  x86/mm: Use pr_cont() in dump_pagetable()
  x86/mm: Fix SME encryption stack ptr handling
  kvm/x86: Avoid clearing the C-bit in rsvd_bits()
  x86/CPU: Align CR3 defines
  x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  acpi, x86/mm: Remove encryption mask from ACPI page protection type
  x86/mm, kexec: Fix memory corruption with SME on successive kexecs
  x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt
  x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y
  x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
  x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y
  x86/mm: Allow userspace have mappings above 47-bit
  x86/mm: Prepare to expose larger address space to userspace
  x86/mpx: Do not allow MPX if we have mappings above 47-bit
  x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
  x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
  x86/mm/dump_pagetables: Fix printout of p4d level
  x86/mm/dump_pagetables: Generalize address normalization
  x86/boot: Fix memremap() related build failure
  ...
2017-09-04 12:21:28 -07:00
Jérôme Glisse
fb1522e099 KVM: update to new mmu_notifier semantic v2
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Changed since v1 (Linus Torvalds)
    - remove now useless kvm_arch_mmu_notifier_invalidate_page()

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Tested-by: Adam Borowski <kilobyte@angband.pl>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 16:13:00 -07:00
Paolo Bonzini
b9dd21e104 KVM: x86: simplify handling of PKRU
Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a
simple comparison against the host value of the register.  The write of
PKRU in addition can be skipped if the guest has not enabled the feature.
Once we do this, we need not test OSPKE in the host anymore, because
guest_CR4.PKE=1 implies host_CR4.PKE=1.

The static PKU test is kept to elide the code on older CPUs.

Suggested-by: Yang Zhang <zy107165@alibaba-inc.com>
Fixes: 1be0e61c1f
Cc: stable@vger.kernel.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-25 09:28:28 +02:00
Wanpeng Li
664f8e26b0 KVM: X86: Fix loss of exception which has not yet been injected
vmx_complete_interrupts() assumes that the exception is always injected,
so it can be dropped by kvm_clear_exception_queue().  However,
an exception cannot be injected immediately if it is: 1) originally
destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.

This patch applies to exceptions the same algorithm that is used for
NMIs, replacing exception.reinject with "exception.injected" (equivalent
to nmi_injected).

exception.pending now represents an exception that is queued and whose
side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
If exception.pending is true, the exception might result in a nested
vmexit instead, too (in which case the side effects must not be applied).

exception.injected instead represents an exception that is going to be
injected into the guest at the next vmentry.

Reported-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:19 +02:00
Yu Zhang
fd8cb43373 KVM: MMU: Expose the LA57 feature to VM.
This patch exposes 5 level page table feature to the VM.
At the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:17 +02:00
Yu Zhang
855feb6736 KVM: MMU: Add 5 level EPT & Shadow page table support.
Extends the shadow paging code, so that 5 level shadow page
table can be constructed if VM is running in 5 level paging
mode.

Also extends the ept code, so that 5 level ept table can be
constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
shadow logic, KVM should still use 4 level ept table for a VM
whose physical address width is less than 48 bits, even when
the VM is running in 5 level paging mode.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
[Unconditionally reset the MMU context in kvm_cpuid_update.
 Changing MAXPHYADDR invalidates the reserved bit bitmasks.
 - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:17 +02:00
Yu Zhang
2a7266a8f9 KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL.
Now we have 4 level page table and 5 level page table in 64 bits
long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
helpful to make the code more clear.

Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just
redefine it to 5 whenever a replacement is needed for 5 level
paging.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:16 +02:00
Yu Zhang
d1cd3ce900 KVM: MMU: check guest CR3 reserved bits based on its physical address width.
Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
reserved bits in CR3. Yet the length of reserved bits in
guest CR3 should be based on the physical address width
exposed to the VM. This patch changes CR3 check logic to
calculate the reserved bits at runtime.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:16 +02:00
Brijesh Singh
618232e219 KVM: x86: Avoid guest page table walk when gpa_available is set
When a guest causes a page fault which requires emulation, the
vcpu->arch.gpa_available flag is set to indicate that cr2 contains a
valid GPA.

Currently, emulator_read_write_onepage() makes use of gpa_available flag
to avoid a guest page walk for a known MMIO regions. Lets not limit
the gpa_available optimization to just MMIO region. The patch extends
the check to avoid page walk whenever gpa_available flag is set.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
[Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the
 new code. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Moved "ret < 0" to the else brach, as per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 14:37:49 +02:00
Paolo Bonzini
eebed24389 kvm: nVMX: Add support for fast unprotection of nested guest page tables
This is the same as commit 147277540b ("kvm: svm: Add support for
additional SVM NPF error codes", 2016-11-23), but for Intel processors.
In this case, the exit qualification field's bit 8 says whether the
EPT violation occurred while translating the guest's final physical
address or rather while translating the guest page tables.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10 16:44:04 +02:00
Longpeng(Mike)
de63ad4cf4 KVM: X86: implement the logic for spinlock optimization
get_cpl requires vcpu_load, so we must cache the result (whether the
vcpu was preempted when its cpl=0) in kvm_vcpu_arch.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08 10:57:43 +02:00
Tom Lendacky
d0ec49d4de kvm/x86/svm: Support Secure Memory Encryption within KVM
Update the KVM support to work with SME. The VMCB has a number of fields
where physical addresses are used and these addresses must contain the
memory encryption mask in order to properly access the encrypted memory.
Also, use the memory encryption mask when creating and using the nested
page tables.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/89146eccfa50334409801ff20acd52a90fb5efcf.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-18 11:38:04 +02:00
Roman Kagan
d3457c877b kvm: x86: hyperv: make VP_INDEX managed by userspace
Hyper-V identifies vCPUs by Virtual Processor Index, which can be
queried via HV_X64_MSR_VP_INDEX msr.  It is defined by the spec as a
sequential number which can't exceed the maximum number of vCPUs per VM.
APIC ids can be sparse and thus aren't a valid replacement for VP
indices.

Current KVM uses its internal vcpu index as VP_INDEX.  However, to make
it predictable and persistent across VM migrations, the userspace has to
control the value of VP_INDEX.

This patch achieves that, by storing vp_index explicitly on vcpu, and
allowing HV_X64_MSR_VP_INDEX to be set from the host side.  For
compatibility it's initialized to KVM vcpu index.  Also a few variables
are renamed to make clear distinction betweed this Hyper-V vp_index and
KVM vcpu_id (== APIC id).  Besides, a new capability,
KVM_CAP_HYPERV_VP_INDEX, is added to allow the userspace to skip
attempting msr writes where unsupported, to avoid spamming error logs.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 16:28:18 +02:00
Wanpeng Li
52a5c155cf KVM: async_pf: Let guest support delivery of async_pf from guest mode
Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1,
async page faults are delivered to L1 as #PF vmexits; if bit 2 is 0,
kvm_can_do_async_pf returns 0 if in guest mode.

This is similar to what svm.c wanted to do all along, but it is only
enabled for Linux as L1 hypervisor.  Foreign hypervisors must never
receive async page faults as vmexits, because they'd probably be very
confused about that.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 14:26:16 +02:00
Wanpeng Li
adfe20fb48 KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf
Add an nested_apf field to vcpu->arch.exception to identify an async page
fault, and constructs the expected vm-exit information fields. Force a
nested VM exit from nested_vmx_check_exception() if the injected #PF is
async page fault.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 14:26:16 +02:00
Wanpeng Li
1261bfa326 KVM: async_pf: Add L1 guest async_pf #PF vmexit handler
This patch adds the L1 guest async page fault #PF vmexit handler, such
by L1 similar to ordinary async page fault.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Passed insn parameters to kvm_mmu_page_fault().]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 14:25:24 +02:00
Wanpeng Li
cfcd20e5ca KVM: x86: Simplify kvm_x86_ops->queue_exception parameter list
This patch removes all arguments except the first in
kvm_x86_ops->queue_exception since they can extract the arguments from
vcpu->arch.exception themselves.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 14:24:28 +02:00
Roman Kagan
efc479e690 kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2
There is a flaw in the Hyper-V SynIC implementation in KVM: when message
page or event flags page is enabled by setting the corresponding msr,
KVM zeroes it out.  This is problematic because on migration the
corresponding MSRs are loaded on the destination, so the content of
those pages is lost.

This went unnoticed so far because the only user of those pages was
in-KVM hyperv synic timers, which could continue working despite that
zeroing.

Newer QEMU uses those pages for Hyper-V VMBus implementation, and
zeroing them breaks the migration.

Besides, in newer QEMU the content of those pages is fully managed by
QEMU, so zeroing them is undesirable even when writing the MSRs from the
guest side.

To support this new scheme, introduce a new capability,
KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
pages aren't zeroed out in KVM.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-13 17:41:04 +02:00
Ladi Prosek
a826faf108 KVM: x86: make backwards_tsc_observed a per-VM variable
The backwards_tsc_observed global introduced in commit 16a9602 is never
reset to false. If a VM happens to be running while the host is suspended
(a common source of the TSC jumping backwards), master clock will never
be enabled again for any VM. In contrast, if no VM is running while the
host is suspended, master clock is unaffected. This is inconsistent and
unnecessarily strict. Let's track the backwards_tsc_observed variable
separately and let each VM start with a clean slate.

Real world impact: My Windows VMs get slower after my laptop undergoes a
suspend/resume cycle. The only way to get the perf back is unloading and
reloading the kvm module.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-13 17:25:39 +02:00
Peter Feiner
ac8d57e573 kvm: x86: mmu: allow A/D bits to be disabled in an mmu
Adds the plumbing to disable A/D bits in the MMU based on a new role
bit, ad_disabled. When A/D is disabled, the MMU operates as though A/D
aren't available (i.e., using access tracking faults instead).

To avoid SP -> kvm_mmu_page.role.ad_disabled lookups all over the
place, A/D disablement is now stored in the SPTE. This state is stored
in the SPTE by tweaking the use of SPTE_SPECIAL_MASK for access
tracking. Rather than just setting SPTE_SPECIAL_MASK when an
access-tracking SPTE is non-present, we now always set
SPTE_SPECIAL_MASK for access-tracking SPTEs.

Signed-off-by: Peter Feiner <pfeiner@google.com>
[Use role.ad_disabled even for direct (non-shadow) EPT page tables.  Add
 documentation and a few MMU_WARN_ONs. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-07-03 11:19:54 +02:00
Paolo Bonzini
04a7ea04d5 KVM/ARM updates for 4.13
- vcpu request overhaul
 - allow timer and PMU to have their interrupt number
   selected from userspace
 - workaround for Cavium erratum 30115
 - handling of memory poisonning
 - the usual crop of fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAllWCM0VHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDjJ0QAI16x6+trKhH31lTSYekYfqm4hZ2
 Fp7IbALW9KNCaY35tZov2Zuh99qGRduxTh7ewqhKpON8kkU+UKj0F7zH22+vfN4m
 yas/+uNr8R9VLyvea4ysPsgx8Q8v1Ix9setohHYNZIL9/klVqtaHpYvArHVF/mzq
 p2j/NxRS2dlp9r2TtoMRMhA05u6r0wolhUuh+z9v2ipib0gfOBIG24jsqCTEcD9n
 5A/cVd+ztYshkrV95h3y9peahwt3zOA4QBGzrQ2K25jp0s54nqhmC7JTNSa8dtar
 YGW2MuAMoIFTwCFAlpwCzrwpOJFzF3Q6A8bOxei2fjclzjPMgT1xQxuhOoe4ntFa
 lTPxSHalm5W6dFTW90YSo2DBcPe+N7sQkhjR0cCeY3GYsOFhXMLTlOl5Pt1YK1or
 +3FAI74tFRKvVmb9mhZeGTvuzhDgRvtf3Qq5rjwlGzKc2BBOEgtMyj/Wgwo4N6Dz
 IjOnoRaUGELoBCWoTorMxLpsPBdPVSUxNyJTdAhqZ/ZtT1xqjhFNLZcrVWmOTzDM
 1cav+jZkla4sLmJSNDD54aCSvvtPHis0nZn9PRlh12xgOyYiAVx4K++MNuWP0P37
 hbh1gbPT+FcoVxPurUsX/pjNlTucPZcBwFytZDQlpwtPBpEFzJiImLYe/PldRb0f
 9WQOH1Y1+q14MF+N
 =6hNK
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM updates for 4.13

- vcpu request overhaul
- allow timer and PMU to have their interrupt number
  selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups

Conflicts:
	arch/s390/include/asm/kvm_host.h
2017-06-30 12:38:26 +02:00
Andrew Jones
2387149ead KVM: improve arch vcpu request defining
Marc Zyngier suggested that we define the arch specific VCPU request
base, rather than requiring each arch to remember to start from 8.
That suggestion, along with Radim Krcmar's recent VCPU request flag
addition, snowballed into defining something of an arch VCPU request
defining API.

No functional change.

(Looks like x86 is running out of arch VCPU request bits.  Maybe
 someday we'll need to extend to 64.)

Signed-off-by: Andrew Jones <drjones@redhat.com>
Acked-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
2017-06-04 16:53:00 +02:00
Paolo Bonzini
b401ee0b85 KVM: x86: lower default for halt_poll_ns
In some fio benchmarks, halt_poll_ns=400000 caused CPU utilization to
increase heavily even in cases where the performance improvement was
small.  In particular, bandwidth divided by CPU usage was as much as
60% lower.

To some extent this is the expected effect of the patch, and the
additional CPU utilization is only visible when running the
benchmarks.  However, halving the threshold also halves the extra
CPU utilization (from +30-130% to +20-70%) and has no negative
effect on performance.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-05-16 21:15:50 +02:00
Bandan Das
bab4165e2f kvm: x86: Add a hook for arch specific dirty logging emulation
When KVM updates accessed/dirty bits, this hook can be used
to invoke an arch specific function that implements/emulates
dirty logging such as PML.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-09 11:54:16 +02:00
David Hildenbrand
5c0aea0e8d KVM: x86: don't hold kvm->lock in KVM_SET_GSI_ROUTING
We needed the lock to avoid racing with creation of the irqchip on x86. As
kvm_set_irq_routing() calls srcu_synchronize_expedited(), this lock
might be held for a longer time.

Let's introduce an arch specific callback to check if we can actually
add irq routes. For x86, all we have to do is check if we have an
irqchip in the kernel. We don't need kvm->lock at that point as the
irqchip is marked as inititalized only when actually fully created.

Reported-by: Steve Rutherford <srutherford@google.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Fixes: 1df6ddede1 ("KVM: x86: race between KVM_SET_GSI_ROUTING and KVM_CREATE_IRQCHIP")
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-02 14:45:45 +02:00
Paolo Bonzini
7a97cec26b KVM: mark requests that need synchronization
kvm_make_all_requests() provides a synchronization that waits until all
kicked VCPUs have acknowledged the kick.  This is important for
KVM_REQ_MMU_RELOAD as it prevents freeing while lockless paging is
underway.

This patch adds the synchronization property into all requests that are
currently being used with kvm_make_all_requests() in order to preserve
the current behavior and only introduce a new framework.  Removing it
from requests where it is not necessary is left for future patches.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 14:36:44 +02:00
Radim Krčmář
930f7fd6da KVM: mark requests that do not need a wakeup
Some operations must ensure that the guest is not running with stale
data, but if the guest is halted, then the update can wait until another
event happens.  kvm_make_all_requests() currently doesn't wake up, so we
can mark all requests used with it.

First 8 bits were arbitrarily reserved for request numbers.

Most uses of requests have the request type as a constant, so a compiler
will optimize the '&'.

An alternative would be to have an inline function that would return
whether the request needs a wake-up or not, but I like this one better
even though it might produce worse assembly.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 14:13:57 +02:00
Kyle Huey
db2336a804 KVM: x86: virtualize cpuid faulting
Hardware support for faulting on the cpuid instruction is not required to
emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
cpuid-induced VM exit checks the cpuid faulting state and the CPL.
kvm_require_cpl is even kind enough to inject the GP fault for us.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[Return "1" from kvm_emulate_cpuid, it's not void. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-21 12:50:06 +02:00
David Hildenbrand
637e3f86fa KVM: x86: new irqchip mode KVM_IRQCHIP_INIT_IN_PROGRESS
Let's add a new mode and set it while we create the irqchip via
KVM_CREATE_IRQCHIP and KVM_CAP_SPLIT_IRQCHIP.

This mode will be used later to test if adding routes
(in kvm_set_routing_entry()) is already allowed.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:13 +02:00
Paolo Bonzini
4b4357e025 kvm: make KVM_COALESCED_MMIO_PAGE_OFFSET public
Its value has never changed; we might as well make it part of the ABI instead
of using the return value of KVM_CHECK_EXTENSION(KVM_CAP_COALESCED_MMIO).

Because PPC does not always make MMIO available, the code has to be made
dependent on CONFIG_KVM_MMIO rather than KVM_COALESCED_MMIO_PAGE_OFFSET.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:01 +02:00
Paolo Bonzini
ae1e2d1082 kvm: nVMX: support EPT accessed/dirty bits
Now use bit 6 of EPTP to optionally enable A/D bits for EPTP.  Another
thing to change is that, when EPT accessed and dirty bits are not in use,
VMX treats accesses to guest paging structures as data reads.  When they
are in use (bit 6 of EPTP is set), they are treated as writes and the
corresponding EPT dirty bit is set.  The MMU didn't know this detail,
so this patch adds it.

We also have to fix up the exit qualification.  It may be wrong because
KVM sets bit 6 but the guest might not.

L1 emulates EPT A/D bits using write permissions, so in principle it may
be possible for EPT A/D bits to be used by L1 even though not available
in hardware.  The problem is that guest page-table walks will be treated
as reads rather than writes, so they would not cause an EPT violation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed typo in walk_addr_generic() comment and changed bit clear +
 conditional-set pattern in handle_ept_violation() to conditional-clear]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Paolo Bonzini
bd7e5b0899 KVM: x86: remove code for lazy FPU handling
The FPU is always active now when running KVM.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-17 12:28:01 +01:00
Paolo Bonzini
76dfafd536 KVM: x86: do not scan IRR twice on APICv vmentry
Calls to apic_find_highest_irr are scanning IRR twice, once
in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr;
now that it does the computation, it can also do the RVI write.

In order to avoid complications in svm.c, make the callback optional.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:35 +01:00
Paolo Bonzini
0f1e261ead KVM: x86: add VCPU stat for KVM_REQ_EVENT processing
This statistic can be useful to estimate the cost of an IRQ injection
scenario, by comparing it with irq_injections.  For example the stat
shows that sti;hlt triggers more KVM_REQ_EVENT than sti;nop.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:59 +01:00
Tom Lendacky
0f89b207b0 kvm: svm: Use the hardware provided GPA instead of page walk
When a guest causes a NPF which requires emulation, KVM sometimes walks
the guest page tables to translate the GVA to a GPA. This is unnecessary
most of the time on AMD hardware since the hardware provides the GPA in
EXITINFO2.

The only exception cases involve string operations involving rep or
operations that use two memory locations. With rep, the GPA will only be
the value of the initial NPF and with dual memory locations we won't know
which memory address was translated into EXITINFO2.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:58 +01:00
Junaid Shahid
f160c7b7bb kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
This change implements lockless access tracking for Intel CPUs without EPT
A bits. This is achieved by marking the PTEs as not-present (but not
completely clearing them) when clear_flush_young() is called after marking
the pages as accessed. When an EPT Violation is generated as a result of
the VM accessing those pages, the PTEs are restored to their original values.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:11 +01:00
Junaid Shahid
37f0e8fe6b kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs
MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
acceptable. However, the upcoming fast access tracking feature adds another
type of special tracking PTE, which uses not-Present PTEs and hence should
not set bit 63.

In order to use common bits to distinguish both type of special PTEs, we
now use only bit 62 as the special bit.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:10 +01:00
David Matlack
114df303a7 kvm: x86: reduce collisions in mmu_page_hash
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.

The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.

hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):

hash            max_mmu_page_hash_collisions
--------------------------------------------
low 10 bits     49847
hash_64         105
perfect         97

While we're changing the hash, increase the table size by 4x to better
support large VMs (further reduces number of collisions in 200G VM to
29).

Note that hash_64() does not provide a good distribution prior to commit
ef703f49a6 ("Eliminate bad hash multipliers from hash_32() and
hash_64()").

Signed-off-by: David Matlack <dmatlack@google.com>
Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:04 +01:00
David Matlack
f3414bc774 kvm: x86: export maximum number of mmu_page_hash collisions
Report the maximum number of mmu_page_hash collisions as a per-VM stat.
This will make it easy to identify problems with the mmu_page_hash in
the future.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:03 +01:00
Radim Krčmář
49776faf93 KVM: x86: decouple irqchip_in_kernel() and pic_irqchip()
irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it
just complicated the code.
Add a separate state for the irqchip mode.

Reviewed-by: David Hildenbrand <david@redhat.com>
[Used Paolo's version of condition in irqchip_in_kernel().]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-09 14:42:47 +01:00
Thomas Gleixner
a5a1d1c291 clocksource: Use a plain u64 instead of cycle_t
There is no point in having an extra type for extra confusion. u64 is
unambiguous.

Conversion was done with the following coccinelle script:

@rem@
@@
-typedef u64 cycle_t;

@fix@
typedef cycle_t;
@@
-cycle_t
+u64

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
2016-12-25 11:04:12 +01:00
Paolo Bonzini
3f5ad8be37 KVM: hyperv: fix locking of struct kvm_hv fields
Introduce a new mutex to avoid an AB-BA deadlock between kvm->lock and
vcpu->mutex.  Protect accesses in kvm_hv_setup_tsc_page too, as suggested
by Roman.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-16 17:53:38 +01:00
Ladi Prosek
9ed38ffad4 KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
as implemented in kvm_set_cr3, in several ways.

* different rules are followed to check CR3 and it is desirable for the caller
to distinguish between the possible failures
* PDPTRs are not loaded if PAE paging and nested EPT are both enabled
* many MMU operations are not necessary

This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
nested vmentry and vmexit, and makes use of it on the nested vmentry path.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:10 +01:00
Kyle Huey
6affcbedca KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.

Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:

- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
  MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
  KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
  IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
  take precedence. The singlestep will be triggered again on the next
  instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
  the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
  which do not trigger singlestep traps as mentioned previously.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:05 +01:00
Kyle Huey
6a908b628c KVM: x86: Add a return value to kvm_emulate_cpuid
Once skipping the emulated instruction can potentially trigger an exit to
userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
propagate a return value.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:03 +01:00
Tom Lendacky
8370c3d08b kvm: svm: Add kvm_fast_pio_in support
Update the I/O interception support to add the kvm_fast_pio_in function
to speed up the in instruction similar to the out instruction.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:45 +01:00
Tom Lendacky
147277540b kvm: svm: Add support for additional SVM NPF error codes
AMD hardware adds two additional bits to aid in nested page fault handling.

Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables

The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.

Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:26 +01:00
Paolo Bonzini
ea26e4ec08 KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK
Since commit a545ab6a00 ("kvm: x86: add tsc_offset field to struct
kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
cached and need not be fished out of the VMCS or VMCB.  This means
that we can implement adjust_tsc_offset_guest and read_l1_tsc
entirely in generic code.  The simplification is particularly
significant for VMX code, where vmx->nested.vmcs01_tsc_offset
was duplicating what is now in vcpu->arch.tsc_offset.  Therefore
the vmcs01_tsc_offset can be dropped completely.

More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
which, after commit 108b249c45 ("KVM: x86: introduce get_kvmclock_ns",
2016-09-01) called read_l1_tsc while the VMCS was not loaded.
It thus returned bogus values on Intel CPUs.

Fixes: 108b249c45
Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 20:03:07 +01:00
Paolo Bonzini
095cf55df7 KVM: x86: Hyper-V tsc page setup
Lately tsc page was implemented but filled with empty
values. This patch setup tsc page scale and offset based
on vcpu tsc, tsc_khz and  HV_X64_MSR_TIME_REF_COUNT value.

The valid tsc page drops HV_X64_MSR_TIME_REF_COUNT msr
reads count to zero which potentially improves performance.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
[Computation of TSC page parameters rewritten to use the Linux timekeeper
 parameters. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:20 +02:00
Luiz Capitulino
3e3f50262e kvm: x86: drop read_tsc_offset()
The TSC offset can now be read directly from struct kvm_arch_vcpu.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:45 +02:00
Luiz Capitulino
a545ab6a00 kvm: x86: add tsc_offset field to struct kvm_vcpu_arch
A future commit will want to easily read a vCPU's TSC offset,
so we store it in struct kvm_arch_vcpu_arch for easy access.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:45 +02:00
Paolo Bonzini
ad53e35ae5 Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
Paul Mackerras writes:

    The highlights are:

    * Reduced latency for interrupts from PCI pass-through devices, from
      Suresh Warrier and me.
    * Halt-polling implementation from Suraj Jitindar Singh.
    * 64-bit VCPU statistics, also from Suraj.
    * Various other minor fixes and improvements.
2016-09-13 15:20:55 +02:00
Suravee Suthikulpanit
5881f73757 svm: Introduce AMD IOMMU avic_ga_log_notifier
This patch introduces avic_ga_log_notifier, which will be called
by IOMMU driver whenever it handles the Guest vAPIC (GA) log entry.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 13:11:56 +02:00
Suravee Suthikulpanit
5ea11f2b31 svm: Introduces AVIC per-VM ID
Introduces per-VM AVIC ID and helper functions to manage the IDs.
Currently, the ID will be used to implement 32-bit AVIC IOMMU GA tag.

The ID is 24-bit one-based indexing value, and is managed via helper
functions to get the next ID, or to free an ID once a VM is destroyed.
There should be no ID conflict for any active VMs.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 12:57:20 +02:00
Suraj Jitindar Singh
8a7e75d47b KVM: Add provisioning for ulong vm stats and u64 vcpu stats
vms and vcpus have statistics associated with them which can be viewed
within the debugfs. Currently it is assumed within the vcpu_stat_get() and
vm_stat_get() functions that all of these statistics are represented as
u32s, however the next patch adds some u64 vcpu statistics.

Change all vcpu statistics to u64 and modify vcpu_stat_get() accordingly.
Since vcpu statistics are per vcpu, they will only be updated by a single
vcpu at a time so this shouldn't present a problem on 32-bit machines
which can't atomically increment 64-bit numbers. However vm statistics
could potentially be updated by multiple vcpus from that vm at a time.
To avoid the overhead of atomics make all vm statistics ulong such that
they are 64-bit on 64-bit systems where they can be atomically incremented
and are 32-bit on 32-bit systems which may not be able to atomically
increment 64-bit numbers. Modify vm_stat_get() to expect ulongs.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-08 12:25:37 +10:00
Linus Torvalds
221bb8a46e - ARM: GICv3 ITS emulation and various fixes. Removal of the old
VGIC implementation.
 
 - s390: support for trapping software breakpoints, nested virtualization
 (vSIE), the STHYI opcode, initial extensions for CPU model support.
 
 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups,
 preliminary to this and the upcoming support for hardware virtualization
 extensions.
 
 - x86: support for execute-only mappings in nested EPT; reduced vmexit
 latency for TSC deadline timer (by about 30%) on Intel hosts; support for
 more than 255 vCPUs.
 
 - PPC: bugfixes.
 
 The ugly bit is the conflicts.  A couple of them are simple conflicts due
 to 4.7 fixes, but most of them are with other trees. There was definitely
 too much reliance on Acked-by here.  Some conflicts are for KVM patches
 where _I_ gave my Acked-by, but the worst are for this pull request's
 patches that touch files outside arch/*/kvm.  KVM submaintainers should
 probably learn to synchronize better with arch maintainers, with the
 latter providing topic branches whenever possible instead of Acked-by.
 This is what we do with arch/x86.  And I should learn to refuse pull
 requests when linux-next sends scary signals, even if that means that
 submaintainers have to rebase their branches.
 
 Anyhow, here's the list:
 
 - arch/x86/kvm/vmx.c: handle_pcommit and EXIT_REASON_PCOMMIT was removed
 by the nvdimm tree.  This tree adds handle_preemption_timer and
 EXIT_REASON_PREEMPTION_TIMER at the same place.  In general all mentions
 of pcommit have to go.
 
 There is also a conflict between a stable fix and this patch, where the
 stable fix removed the vmx_create_pml_buffer function and its call.
 
 - virt/kvm/kvm_main.c: kvm_cpu_notifier was removed by the hotplug tree.
 This tree adds kvm_io_bus_get_dev at the same place.
 
 - virt/kvm/arm/vgic.c: a few final bugfixes went into 4.7 before the
 file was completely removed for 4.8.
 
 - include/linux/irqchip/arm-gic-v3.h: this one is entirely our fault;
 this is a change that should have gone in through the irqchip tree and
 pulled by kvm-arm.  I think I would have rejected this kvm-arm pull
 request.  The KVM version is the right one, except that it lacks
 GITS_BASER_PAGES_SHIFT.
 
 - arch/powerpc: what a mess.  For the idle_book3s.S conflict, the KVM
 tree is the right one; everything else is trivial.  In this case I am
 not quite sure what went wrong.  The commit that is causing the mess
 (fd7bacbca4, "KVM: PPC: Book3S HV: Fix TB corruption in guest exit
 path on HMI interrupt", 2016-05-15) touches both arch/powerpc/kernel/
 and arch/powerpc/kvm/.  It's large, but at 396 insertions/5 deletions
 I guessed that it wasn't really possible to split it and that the 5
 deletions wouldn't conflict.  That wasn't the case.
 
 - arch/s390: also messy.  First is hypfs_diag.c where the KVM tree
 moved some code and the s390 tree patched it.  You have to reapply the
 relevant part of commits 6c22c98637, plus all of e030c1125e, to
 arch/s390/kernel/diag.c.  Or pick the linux-next conflict
 resolution from http://marc.info/?l=kvm&m=146717549531603&w=2.
 Second, there is a conflict in gmap.c between a stable fix and 4.8.
 The KVM version here is the correct one.
 
 I have pushed my resolution at refs/heads/merge-20160802 (commit
 3d1f53419842) at git://git.kernel.org/pub/scm/virt/kvm/kvm.git.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXoGm7AAoJEL/70l94x66DugQIAIj703ePAFepB/fCrKHkZZia
 SGrsBdvAtNsOhr7FQ5qvvjLxiv/cv7CymeuJivX8H+4kuUHUllDzey+RPHYHD9X7
 U6n1PdCH9F15a3IXc8tDjlDdOMNIKJixYuq1UyNZMU6NFwl00+TZf9JF8A2US65b
 x/41W98ilL6nNBAsoDVmCLtPNWAqQ3lajaZELGfcqRQ9ZGKcAYOaLFXHv2YHf2XC
 qIDMf+slBGSQ66UoATnYV2gAopNlWbZ7n0vO6tE2KyvhHZ1m399aBX1+k8la/0JI
 69r+Tz7ZHUSFtmlmyByi5IAB87myy2WQHyAPwj+4vwJkDGPcl0TrupzbG7+T05Y=
 =42ti
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:

 - ARM: GICv3 ITS emulation and various fixes.  Removal of the
   old VGIC implementation.

 - s390: support for trapping software breakpoints, nested
   virtualization (vSIE), the STHYI opcode, initial extensions
   for CPU model support.

 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots
   of cleanups, preliminary to this and the upcoming support for
   hardware virtualization extensions.

 - x86: support for execute-only mappings in nested EPT; reduced
   vmexit latency for TSC deadline timer (by about 30%) on Intel
   hosts; support for more than 255 vCPUs.

 - PPC: bugfixes.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
  KVM: PPC: Introduce KVM_CAP_PPC_HTM
  MIPS: Select HAVE_KVM for MIPS64_R{2,6}
  MIPS: KVM: Reset CP0_PageMask during host TLB flush
  MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
  MIPS: KVM: Sign extend MFC0/RDHWR results
  MIPS: KVM: Fix 64-bit big endian dynamic translation
  MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
  MIPS: KVM: Use 64-bit CP0_EBase when appropriate
  MIPS: KVM: Set CP0_Status.KX on MIPS64
  MIPS: KVM: Make entry code MIPS64 friendly
  MIPS: KVM: Use kmap instead of CKSEG0ADDR()
  MIPS: KVM: Use virt_to_phys() to get commpage PFN
  MIPS: Fix definition of KSEGX() for 64-bit
  KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
  kvm: x86: nVMX: maintain internal copy of current VMCS
  KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
  KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
  KVM: arm64: vgic-its: Simplify MAPI error handling
  KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
  KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
  ...
2016-08-02 16:11:27 -04:00
Radim Krčmář
af1bae5497 KVM: x86: bump KVM_MAX_VCPU_ID to 1023
kzalloc was replaced with kvm_kvzalloc to allow non-contiguous areas and
rcu had to be modified to cope with it.

The practical limit for KVM_MAX_VCPU_ID right now is INT_MAX, but lower
value was chosen in case there were bugs.  1023 is sufficient maximum
APIC ID for 288 VCPUs.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:29:35 +02:00
Radim Krčmář
682f732ecf KVM: x86: bump MAX_VCPUS to 288
288 is in high demand because of Knights Landing CPU.
We cannot set the limit to 640k, because that would be wasting space.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:29:34 +02:00
Radim Krčmář
c519265f2a KVM: x86: add a flag to disable KVM x2apic broadcast quirk
Add KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK as a feature flag to
KVM_CAP_X2APIC_API.

The quirk made KVM interpret 0xff as a broadcast even in x2APIC mode.
The enableable capability is needed in order to support standard x2APIC and
remain backward compatible.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[Expand kvm_apic_mda comment. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:29:34 +02:00
Radim Krčmář
3713131345 KVM: x86: add KVM_CAP_X2APIC_API
KVM_CAP_X2APIC_API is a capability for features related to x2APIC
enablement.  KVM_X2APIC_API_32BIT_FORMAT feature can be enabled to
extend APIC ID in get/set ioctl and MSI addresses to 32 bits.
Both are needed to support x2APIC.

The feature has to be enableable and disabled by default, because
get/set ioctl shifted and truncated APIC ID to 8 bits by using a
non-standard protocol inspired by xAPIC and the change is not
backward-compatible.

Changes to MSI addresses follow the format used by interrupt remapping
unit.  The upper address word, that used to be 0, contains upper 24 bits
of the LAPIC address in its upper 24 bits.  Lower 8 bits are reserved as
0.  Using the upper address word is not backward-compatible either as we
didn't check that userspace zeroed the word.  Reserved bits are still
not explicitly checked, but non-zero data will affect LAPIC addresses,
which will cause a bug.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:57 +02:00
Radim Krčmář
0ca52e7b81 KVM: x86: dynamic kvm_apic_map
x2APIC supports up to 2^32-1 LAPICs, but most guest in coming years will
probably has fewer VCPUs.  Dynamic size saves memory at the cost of
turning one constant into a variable.

apic_map mutex had to be moved before allocation to avoid races with cpu
hotplug.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:53 +02:00
Radim Krčmář
e45115b62f KVM: x86: use physical LAPIC array for logical x2APIC
Logical x2APIC IDs map injectively to physical x2APIC IDs, so we can
reuse the physical array for them.  This allows us to save space by
sizing the logical maps according to the needs of xAPIC.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:52 +02:00
Radim Krčmář
757883de41 KVM: x86: bump KVM_SOFT_MAX_VCPUS to 240
240 has been well tested by Red Hat.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:51 +02:00
Bandan Das
ffb128c89b kvm: mmu: don't set the present bit unconditionally
To support execute only mappings on behalf of L1
hypervisors, we need to teach set_spte() to honor all three of
L1's XWR bits.  As a start, add a new variable "shadow_present_mask"
that will be set for non-EPT shadow paging and clear for EPT.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:14 +02:00
Ashok Raj
c45dcc71b7 KVM: VMX: enable guest access to LMCE related MSRs
On Intel platforms, this patch adds LMCE to KVM MCE supported
capabilities and handles guest access to LMCE related MSRs.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported
           Only enable LMCE on Intel platform
           Check MSR_IA32_FEATURE_CONTROL when handling guest
             access to MSR_IA32_MCG_EXT_CTL]
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23 19:17:29 +02:00
Yunhong Jiang
64672c95ea kvm: vmx: hook preemption timer support
Hook the VMX preemption timer to the "hv timer" functionality added
by the previous patch.  This includes: checking if the feature is
supported, if the feature is broken on the CPU, the hooks to
setup/clean the VMX preemption timer, arming the timer on vmentry
and handling the vmexit.

A module parameter states if the VMX preemption timer should be
utilized.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
[Move hv_deadline_tsc to struct vcpu_vmx, use -1 as the "unset" value.
 Put all VMX bits here.  Enable it by default #yolo. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:50 +02:00
Yunhong Jiang
ce7a058a21 KVM: x86: support using the vmx preemption timer for tsc deadline timer
The VMX preemption timer can be used to virtualize the TSC deadline timer.
The VMX preemption timer is armed when the vCPU is running, and a VMExit
will happen if the virtual TSC deadline timer expires.

When the vCPU thread is blocked because of HLT, KVM will switch to use
an hrtimer, and then go back to the VMX preemption timer when the vCPU
thread is unblocked.

This solution avoids the complex OS's hrtimer system, and the host
timer interrupt handling cost, replacing them with a little math
(for guest->host TSC and host TSC->preemption timer conversion)
and a cheaper VMexit.  This benefits latency for isolated pCPUs.

[A word about performance... Yunhong reported a 30% reduction in average
 latency from cyclictest.  I made a similar test with tscdeadline_latency
 from kvm-unit-tests, and measured

 - ~20 clock cycles loss (out of ~3200, so less than 1% but still
   statistically significant) in the worst case where the test halts
   just after programming the TSC deadline timer

 - ~800 clock cycles gain (25% reduction in latency) in the best case
   where the test busy waits.

 I removed the VMX bits from Yunhong's patch, to concentrate them in the
 next patch - Paolo]

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:48 +02:00
Suravee Suthikulpanit
7d669f5084 kvm: svm: Fix implicit declaration for __default_cpu_present_to_apicid()
The commit 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
introduces a build error due to implicit function declaration
when #ifdef CONFIG_X86_32 and #ifndef CONFIG_X86_LOCAL_APIC
(as reported by Kbuild test robot i386-randconfig-x0-06121009).

So, this patch introduces kvm_cpu_get_apicid() wrapper
around __default_cpu_present_to_apicid() with additional
handling if CONFIG_X86_LOCAL_APIC is not defined.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: commit 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 00:28:24 +02:00
Paolo Bonzini
67c9dddc95 KVM: x86: make hwapic_isr_update and hwapic_irr_update look the same
Neither APICv nor AVIC actually need the first argument of
hwapic_isr_update, but the vCPU makes more sense than passing the
pointer to the whole virtual machine!  In fact in the APICv case it's
just happening that the vCPU is used implicitly, through the loaded VMCS.

The second argument instead is named differently, make it consistent.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:32 +02:00
Suravee Suthikulpanit
be8ca170ed KVM: x86: Introducing kvm_x86_ops.apicv_post_state_restore
Adding kvm_x86_ops hooks to allow APICv to do post state restore.
This is required to support VM save and restore feature.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:30 +02:00
Suravee Suthikulpanit
18f40c53e1 svm: Add VMEXIT handlers for AVIC
This patch introduces VMEXIT handlers, avic_incomplete_ipi_interception()
and avic_unaccelerated_access_interception() along with two trace points
(trace_kvm_avic_incomplete_ipi and trace_kvm_avic_unaccelerated_access).

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:29 +02:00
Suravee Suthikulpanit
44a95dae1d KVM: x86: Detect and Initialize AVIC support
This patch introduces AVIC-related data structure, and AVIC
initialization code.

There are three main data structures for AVIC:
    * Virtual APIC (vAPIC) backing page (per-VCPU)
    * Physical APIC ID table (per-VM)
    * Logical APIC ID table (per-VM)

Currently, AVIC is disabled by default. Users can manually
enable AVIC via kernel boot option kvm-amd.avic=1 or during
kvm-amd module loading with parameter avic=1.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[Avoid extra indentation (Boris). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:28 +02:00
Suravee Suthikulpanit
d1ed092f77 KVM: x86: Introducing kvm_x86_ops VCPU blocking/unblocking hooks
Adding new function pointer in struct kvm_x86_ops, and calling them
from the kvm_arch_vcpu[blocking/unblocking].

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:27 +02:00
Suravee Suthikulpanit
03543133ce KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks
Adding function pointers in struct kvm_x86_ops for processor-specific
layer to provide hooks for when KVM initialize and destroy VM.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:26 +02:00
Christian Borntraeger
3491caf275 KVM: halt_polling: provide a way to qualify wakeups during poll
Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.

For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.

This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.

This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-13 17:29:23 +02:00
Liang Chen
c54cdf141c KVM: x86: optimize steal time calculation
Since accumulate_steal_time is now only called in record_steal_time, it
doesn't quite make sense to put the delta calculation in a separate
function. The function could be called thousands of times before guest
enables the steal time MSR (though the compiler may optimize out this
function call). And after it's enabled, the MSR enable bit is tested twice
every time. Removing the accumulate_steal_time function also avoids the
necessity of having the accum_steal field.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-20 15:29:17 +02:00
Paolo Bonzini
14ebda3394 KVM: x86: reduce default value of halt_poll_ns parameter
Windows lets applications choose the frequency of the timer tick,
and in Windows 10 the maximum rate was changed from 1024 Hz to
2048 Hz.  Unfortunately, because of the way the Windows API
works, most applications who need a higher rate than the default
64 Hz will just do

   timeGetDevCaps(&tc, sizeof(tc));
   timeBeginPeriod(tc.wPeriodMin);

and pick the maximum rate.  This causes very high CPU usage when
playing media or games on Windows 10, even if the guest does not
actually use the CPU very much, because the frequent timer tick
causes halt_poll_ns to kick in.

There is no really good solution, especially because Microsoft
could sooner or later bump the limit to 4096 Hz, but for now
the best we can do is lower a bit the upper limit for
halt_poll_ns. :-(

Reported-by: Jon Panozzo <jonp@lime-technology.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-01 12:10:10 +02:00
Huaitong Han
b9baba8614 KVM, pkeys: expose CPUID/CR4 to guest
X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
CPUID.7.0.ECX[3]:PKU. X86_FEATURE_OSPKE is software support for pkeys,
enumerated with CPUID.7.0.ECX[4]:OSPKE, and it reflects the setting of
CR4.PKE(bit 22).

This patch disables CPUID:PKU without ept, because pkeys is not yet
implemented for shadow paging.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:17 +01:00
Huaitong Han
be94f6b710 KVM, pkeys: add pkeys support for permission_fault
Protection keys define a new 4-bit protection key field (PKEY) in bits
62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
register(16 domains), every domain has 2 bits(write disable bit, access
disable bit).

Static logic has been produced in update_pkru_bitmask, dynamic logic need
read pkey from page table entries, get pkru value, and deduce the correct
result.

[ Huaitong: Xiao helps to modify many sections. ]

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:23:37 +01:00
Huaitong Han
2d344105f5 KVM, pkeys: introduce pkru_mask to cache conditions
PKEYS defines a new status bit in the PFEC. PFEC.PK (bit 5), if some
conditions is true, the fault is considered as a PKU violation.
pkru_mask indicates if we need to check PKRU.ADi and PKRU.WDi, and
does cache some conditions for permission_fault.

[ Huaitong: Xiao helps to modify many sections. ]

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:06 +01:00
Paolo Bonzini
5a5fbdc0e3 KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
It is now equal to use_eager_fpu(), which simply tests a cpufeature bit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-09 14:04:36 +01:00
Paolo Bonzini
6bb69c9b69 KVM: MMU: simplify last_pte_bitmap
Branch-free code is fun and everybody knows how much Avi loves it,
but last_pte_bitmap takes it a bit to the extreme.  Since the code
is simply doing a range check, like

	(level == 1 ||
	 ((gpte & PT_PAGE_SIZE_MASK) && level < N)

we can make it branch-free without storing the entire truth table;
it is enough to cache N.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:38 +01:00
Xiao Guangrong
13d268ca2c KVM: MMU: apply page track notifier
Register the notifier to receive write track event so that we can update
our shadow page table

It makes kvm_mmu_pte_write() be the callback of the notifier, no function
is changed

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:24 +01:00
Xiao Guangrong
0eb05bf290 KVM: page track: add notifier support
Notifier list is introduced so that any node wants to receive the track
event can register to the list

Two APIs are introduced here:
- kvm_page_track_register_notifier(): register the notifier to receive
  track event

- kvm_page_track_unregister_notifier(): stop receiving track event by
  unregister the notifier

The callback, node->track_write() is called when a write access on the
write tracked page happens

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:22 +01:00
Xiao Guangrong
e5691a81e8 KVM: MMU: clear write-flooding on the fast path of tracked page
If the page fault is caused by write access on write tracked page, the
real shadow page walking is skipped, we lost the chance to clear write
flooding for the page structure current vcpu is using

Fix it by locklessly waking shadow page table to clear write flooding
on the shadow page structure out of mmu-lock. So that we change the
count to atomic_t

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:22 +01:00
Xiao Guangrong
21ebbedadd KVM: page track: add the framework of guest page tracking
The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero

We use 'unsigned short' as the tracking count which should be enough as
shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2
for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for
smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there
is enough room for other trackers

Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array

Currently, only write track mode is supported

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:20 +01:00
Xiao Guangrong
92f94f1e9e KVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed
kvm_lpage_info->write_count is used to detect if the large page mapping
for the gfn on the specified level is allowed, rename it to disallow_lpage
to reflect its purpose, also we rename has_wrprotected_page() to
mmu_gfn_lpage_is_disallowed() to make the code more clearer

Later we will extend this mechanism for page tracking: if the gfn is
tracked then large mapping for that gfn on any level is not allowed.
The new name is more straightforward

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:19 +01:00
Feng Wu
520040146a KVM: x86: Use vector-hashing to deliver lowest-priority interrupts
Use vector-hashing to deliver lowest-priority interrupts, As an
example, modern Intel CPUs in server platform use this method to
handle lowest-priority interrupts.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-09 13:24:40 +01:00
Paolo Bonzini
2860c4b167 KVM: move architecture-dependent requests to arch/
Since the numbers now overlap, it makes sense to enumerate
them in asm/kvm_host.h rather than linux/kvm_host.h.  Functions
that refer to architecture-specific requests are also moved
to arch/.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:36 +01:00
Andrey Smetanin
1f4b34f825 kvm/x86: Hyper-V SynIC timers
Per Hyper-V specification (and as required by Hyper-V-aware guests),
SynIC provides 4 per-vCPU timers.  Each timer is programmed via a pair
of MSRs, and signals expiration by delivering a special format message
to the configured SynIC message slot and triggering the corresponding
synthetic interrupt.

Note: as implemented by this patch, all periodic timers are "lazy"
(i.e. if the vCPU wasn't scheduled for more than the timer period the
timer events are lost), regardless of the corresponding configuration
MSR.  If deemed necessary, the "catch up" mode (the timer period is
shortened until the timer catches up) will be implemented later.

Changes v2:
* Use remainder to calculate periodic timer expiration time

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:49:45 +01:00
Takuya Yoshikawa
018aabb56d KVM: x86: MMU: Encapsulate the type of rmap-chain head in a new struct
New struct kvm_rmap_head makes the code type-safe to some extent.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:25:44 +01:00
Andrey Smetanin
db3975717a kvm/x86: Hyper-V kvm exit
A new vcpu exit is introduced to notify the userspace of the
changes in Hyper-V SynIC configuration triggered by guest writing to the
corresponding MSRs.

Changes v4:
* exit into userspace only if guest writes into SynIC MSR's

Changes v3:
* added KVM_EXIT_HYPERV types and structs notes into docs

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:24:22 +01:00
Andrey Smetanin
5c919412fe kvm/x86: Hyper-V synthetic interrupt controller
SynIC (synthetic interrupt controller) is a lapic extension,
which is controlled via MSRs and maintains for each vCPU
 - 16 synthetic interrupt "lines" (SINT's); each can be configured to
   trigger a specific interrupt vector optionally with auto-EOI
   semantics
 - a message page in the guest memory with 16 256-byte per-SINT message
   slots
 - an event flag page in the guest memory with 16 2048-bit per-SINT
   event flag areas

The host triggers a SINT whenever it delivers a new message to the
corresponding slot or flips an event flag bit in the corresponding area.
The guest informs the host that it can try delivering a message by
explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
MSR.

The userspace (qemu) triggers interrupts and receives EOM notifications
via irqfd with resampler; for that, a GSI is allocated for each
configured SINT, and irq_routing api is extended to support GSI-SINT
mapping.

Changes v4:
* added activation of SynIC by vcpu KVM_ENABLE_CAP
* added per SynIC active flag
* added deactivation of APICv upon SynIC activation

Changes v3:
* added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
docs

Changes v2:
* do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
* add Hyper-V SynIC vectors into EOI exit bitmap
* Hyper-V SyniIC SINT msr write logic simplified

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:24:22 +01:00
Andrey Smetanin
d62caabb41 kvm/x86: per-vcpu apicv deactivation support
The decision on whether to use hardware APIC virtualization used to be
taken globally, based on the availability of the feature in the CPU
and the value of a module parameter.

However, under certain circumstances we want to control it on per-vcpu
basis.  In particular, when the userspace activates HyperV synthetic
interrupt controller (SynIC), APICv has to be disabled as it's
incompatible with SynIC auto-EOI behavior.

To achieve that, introduce 'apicv_active' flag on struct
kvm_vcpu_arch, and kvm_vcpu_deactivate_apicv() function to turn APICv
off.  The flag is initialized based on the module parameter and CPU
capability, and consulted whenever an APICv-specific action is
performed.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:24:21 +01:00
Andrey Smetanin
6308630bd3 kvm/x86: split ioapic-handled and EOI exit bitmaps
The function to determine if the vector is handled by ioapic used to
rely on the fact that only ioapic-handled vectors were set up to
cause vmexits when virtual apic was in use.

We're going to break this assumption when introducing Hyper-V
synthetic interrupts: they may need to cause vmexits too.

To achieve that, introduce a new bitmap dedicated specifically for
ioapic-handled vectors, and populate EOI exit bitmap from it for now.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:24:21 +01:00
Paolo Bonzini
a96036b8ef KVM: x86: rename update_db_bp_intercept to update_bp_intercept
Because #DB is now intercepted unconditionally, this callback
only operates on #BP for both VMX and SVM.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:25 +01:00
Haozhong Zhang
4ba76538dd KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
Both VMX and SVM scales the host TSC in the same way in call-back
read_l1_tsc(), so this patch moves the scaling logic from call-back
read_l1_tsc() to a common function kvm_read_l1_tsc().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:18 +01:00
Haozhong Zhang
58ea676787 KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
For both VMX and SVM, if the 2nd argument of call-back
adjust_tsc_offset() is the host TSC, then adjust_tsc_offset() will scale
it first. This patch moves this common TSC scaling logic to its caller
adjust_tsc_offset_host() and rename the call-back adjust_tsc_offset() to
adjust_tsc_offset_guest().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:17 +01:00
Haozhong Zhang
07c1419a32 KVM: x86: Replace call-back compute_tsc_offset() with a common function
Both VMX and SVM calculate the tsc-offset in the same way, so this
patch removes the call-back compute_tsc_offset() and replaces it with a
common function kvm_compute_tsc_offset().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:16 +01:00
Haozhong Zhang
381d585c80 KVM: x86: Replace call-back set_tsc_khz() with a common function
Both VMX and SVM propagate virtual_tsc_khz in the same way, so this
patch removes the call-back set_tsc_khz() and replaces it with a common
function.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:16 +01:00
Haozhong Zhang
35181e86df KVM: x86: Add a common TSC scaling function
VMX and SVM calculate the TSC scaling ratio in a similar logic, so this
patch generalizes it to a common TSC scaling function.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
[Inline the multiplication and shift steps into mul_u64_u64_shr.  Remove
 BUG_ON.  - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:15 +01:00
Haozhong Zhang
ad721883e9 KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
This patch moves the field of TSC scaling ratio from the architecture
struct vcpu_svm to the common struct kvm_vcpu_arch.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:14 +01:00
Haozhong Zhang
bc9b961b35 KVM: x86: Collect information for setting TSC scaling ratio
The number of bits of the fractional part of the 64-bit TSC scaling
ratio in VMX and SVM is different. This patch makes the architecture
code to collect the number of fractional bits and other related
information into variables that can be accessed in the common code.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:14 +01:00
Paolo Bonzini
893590c734 KVM: x86: declare a few variables as __read_mostly
These include module parameters and variables that are set by
kvm_x86_ops->hardware_setup.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-10 12:06:13 +01:00
Paolo Bonzini
197a4f4b06 KVM/ARM Changes for v4.4-rc1
Includes a number of fixes for the arch-timer, introducing proper
 level-triggered semantics for the arch-timers, a series of patches to
 synchronously halt a guest (prerequisite for IRQ forwarding), some tracepoint
 improvements, a tweak for the EL2 panic handlers, some more VGIC cleanups
 getting rid of redundant state, and finally a stylistic change that gets rid of
 some ctags warnings.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWOhgAAAoJEEtpOizt6ddyKS8H/2ZHMTPjo6yChnrusNWy4Qbr
 6laPDlzL+g45oMQRwNL7GnM1deRftaxvT2Wi+X84D/6Y/BD6MPds4HgtBfuWcSZ1
 CyRJ0Ot/zrxenucSuJuOjq+a9gdizdAczkbB1MfYDULJH8fb6D+7RYLo3zgh4Xo4
 pla3L9U6gSWe+YopBjZtZH43m3fwiwSM/v+uHOTIcXrsbR+fEgx/EFSKmA/DUCuo
 P5cFO/ceUGu7nATCexu5V82TgR2hvurrsR7mqfwY8YcF6HRM+NEOoS29xWC77v5S
 u/F08TKuKQLv0YTEFTyLETI/oEeuC0cHtrRQBNf4+9kXEOzKyXaae0wR/I6X2Ss=
 =GMNk
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM Changes for v4.4-rc1

Includes a number of fixes for the arch-timer, introducing proper
level-triggered semantics for the arch-timers, a series of patches to
synchronously halt a guest (prerequisite for IRQ forwarding), some tracepoint
improvements, a tweak for the EL2 panic handlers, some more VGIC cleanups
getting rid of redundant state, and finally a stylistic change that gets rid of
some ctags warnings.

Conflicts:
	arch/x86/include/asm/kvm_host.h
2015-11-04 16:24:17 +01:00
Christoffer Dall
3217f7c25b KVM: Add kvm_arch_vcpu_{un}blocking callbacks
Some times it is useful for architecture implementations of KVM to know
when the VCPU thread is about to block or when it comes back from
blocking (arm/arm64 needs to know this to properly implement timers, for
example).

Therefore provide a generic architecture callback function in line with
what we do elsewhere for KVM generic-arch interactions.

Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-22 23:01:41 +02:00
Paolo Bonzini
58f800d5ac Merge branch 'kvm-master' into HEAD
This merge brings in a couple important SMM fixes, which makes it
easier to test latest KVM with unrestricted_guest=0 and to test
the in-progress work on SMM support in the firmware.

Conflicts:
	arch/x86/kvm/x86.c
2015-10-13 21:32:50 +02:00
Paolo Bonzini
1d8007bdee KVM: x86: build kvm_userspace_memory_region in x86_set_memory_region
The next patch will make x86_set_memory_region fill the
userspace_addr.  Since the struct is not used untouched
anymore, it makes sense to build it in x86_set_memory_region
directly; it also simplifies the callers.

Reported-by: Alexandre DERUMIER <aderumier@odiso.com>
Cc: stable@vger.kernel.org
Fixes: 9da0e4d5ac
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-13 18:28:46 +02:00
Feng Wu
bf9f6ac8d7 KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
This patch updates the Posted-Interrupts Descriptor when vCPU
is blocked.

pre-block:
- Add the vCPU to the blocked per-CPU list
- Set 'NV' to POSTED_INTR_WAKEUP_VECTOR

post-block:
- Remove the vCPU from the per-CPU list

Signed-off-by: Feng Wu <feng.wu@intel.com>
[Concentrate invocation of pre/post-block hooks to vcpu_block. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:53 +02:00
Feng Wu
8727688006 KVM: x86: select IRQ_BYPASS_MANAGER
Select IRQ_BYPASS_MANAGER for x86 when CONFIG_KVM is set

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:52 +02:00
Feng Wu
efc644048e KVM: x86: Update IRTE for posted-interrupts
This patch adds the routine to update IRTE for posted-interrupts
when guest changes the interrupt configuration.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
[Squashed in automatically generated patch from the build robot
 "KVM: x86: vcpu_to_pi_desc() can be static" - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:51 +02:00
Feng Wu
d84f1e0755 KVM: make kvm_set_msi_irq() public
Make kvm_set_msi_irq() public, we can use this function outside.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:50 +02:00
Feng Wu
8feb4a04dc KVM: Define a new interface kvm_intr_is_single_vcpu()
This patch defines a new interface kvm_intr_is_single_vcpu(),
which can returns whether the interrupt is for single-CPU or not.

It is used by VT-d PI, since now we only support single-CPU
interrupts, For lowest-priority interrupts, if user configures
it via /proc/irq or uses irqbalance to make it single-CPU, we
can use PI to deliver the interrupts to it. Full functionality
of lowest-priority support will be added later.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:49 +02:00
Andrey Smetanin
9eec50b8bb kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support
HV_X64_MSR_VP_RUNTIME msr used by guest to get
"the time the virtual processor consumes running guest code,
and the time the associated logical processor spends running
hypervisor code on behalf of that guest."

Calculation of this time is performed by task_cputime_adjusted()
for vcpu task.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:33 +02:00
Steve Rutherford
1c1a9ce973 KVM: x86: Add support for local interrupt requests from userspace
In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.

KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.

The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).

When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.

When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.

Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.

Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.

Compiles for x86.

Can pass the KVM Unit Tests.

Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:29 +02:00
Steve Rutherford
b053b2aef2 KVM: x86: Add EOI exit bitmap inference
In order to support a userspace IOAPIC interacting with an in kernel
APIC, the EOI exit bitmaps need to be configurable.

If the IOAPIC is in userspace (i.e. the irqchip has been split), the
EOI exit bitmaps will be set whenever the GSI Routes are configured.
In particular, for the low MSI routes are reservable for userspace
IOAPICs. For these MSI routes, the EOI Exit bit corresponding to the
destination vector of the route will be set for the destination VCPU.

The intention is for the userspace IOAPICs to use the reservable MSI
routes to inject interrupts into the guest.

This is a slight abuse of the notion of an MSI Route, given that MSIs
classically bypass the IOAPIC. It might be worthwhile to add an
additional route type to improve clarity.

Compile tested for Intel x86.

Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:28 +02:00
Steve Rutherford
7543a635aa KVM: x86: Add KVM exit for IOAPIC EOIs
Adds KVM_EXIT_IOAPIC_EOI which allows the kernel to EOI
level-triggered IOAPIC interrupts.

Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs
to be informed (which is identical to the EOI_EXIT_BITMAP field used
by modern x86 processors, but can also be used to elide kvm IOAPIC EOI
exits on older processors).

[Note: A prototype using ResampleFDs found that decoupling the EOI
from the VCPU's thread made it possible for the VCPU to not see a
recent EOI after reentering the guest. This does not match real
hardware.]

Compile tested for Intel x86.

Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:27 +02:00
Steve Rutherford
49df6397ed KVM: x86: Split the APIC from the rest of IRQCHIP.
First patch in a series which enables the relocation of the
PIC/IOAPIC to userspace.

Adds capability KVM_CAP_SPLIT_IRQCHIP;

KVM_CAP_SPLIT_IRQCHIP enables the construction of LAPICs without the
rest of the irqchip.

Compile tested for x86.

Signed-off-by: Steve Rutherford <srutherford@google.com>
Suggested-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:26 +02:00
Paolo Bonzini
d50ab6c1a2 KVM: x86: replace vm_has_apicv hook with cpu_uses_apicv
This will avoid an unnecessary trip to ->kvm and from there to the VPIC.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:24 +02:00
Paolo Bonzini
3bb345f387 KVM: x86: store IOAPIC-handled vectors in each VCPU
We can reuse the algorithm that computes the EOI exit bitmap to figure
out which vectors are handled by the IOAPIC.  The only difference
between the two is for edge-triggered interrupts other than IRQ8
that have no notifiers active; however, the IOAPIC does not have to
do anything special for these interrupts anyway.

This again limits the interactions between the IOAPIC and the LAPIC,
making it easier to move the former to userspace.

Inspired by a patch from Steve Rutherford.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:23 +02:00
David Hildenbrand
920552b213 KVM: disable halt_poll_ns as default for s390x
We observed some performance degradation on s390x with dynamic
halt polling. Until we can provide a proper fix, let's enable
halt_poll_ns as default only for supported architectures.

Architectures are now free to set their own halt_poll_ns
default value.

Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-25 10:31:30 +02:00
Paolo Bonzini
62bea5bff4 KVM: add halt_attempted_poll to VCPU stats
This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.

For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
is consuming about 30% more CPU than it would use without
polling.  This would show as an abnormally high number of
attempted polling compared to the successful polls.

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-16 12:17:00 +02:00
Xiao Guangrong
c258b62b26 KVM: MMU: introduce the framework to check zero bits on sptes
We have abstracted the data struct and functions which are used to check
reserved bit on guest page tables, now we extend the logic to check
zero bits on shadow page tables

The zero bits on sptes include not only reserved bits on hardware but also
the bits that SPTEs willnever use.  For example, shadow pages will never
use GB pages unless the guest uses them too.

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:24 +02:00
Xiao Guangrong
a0a64f50aa KVM: MMU: introduce rsvd_bits_validate
These two fields, rsvd_bits_mask and bad_mt_xwr, in "struct kvm_mmu" are
used to check if reserved bits set on guest ptes, move them to a data
struct so that the approach can be applied to check host shadow page
table entries as well

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-05 12:47:23 +02:00
Paolo Bonzini
d71ba78834 KVM: move code related to KVM_SET_BOOT_CPU_ID to x86
This is another remnant of ia64 support.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-29 14:27:21 +02:00
Andrey Smetanin
e7d9513b60 kvm/x86: added hyper-v crash msrs into kvm hyperv context
Added kvm Hyper-V context hv crash variables as storage
of Hyper-V crash msrs.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-23 08:27:06 +02:00
Andrey Smetanin
e83d58874b kvm/x86: move Hyper-V MSR's/hypercall code into hyperv.c file
This patch introduce Hyper-V related source code file - hyperv.c and
per vm and per vcpu hyperv context structures.
All Hyper-V MSR's and hypercall code moved into hyperv.c.
All Hyper-V kvm/vcpu fields moved into appropriate hyperv context
structures. Copyrights and authors information copied from x86.c
to hyperv.c.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-23 08:27:06 +02:00
Paolo Bonzini
5544eb9b81 KVM: count number of assigned devices
If there are no assigned devices, the guest PAT are not providing
any useful information and can be overridden to writeback; VMX
always does this because it has the "IPAT" bit in its extended
page table entries, but SVM does not have anything similar.
Hook into VFIO and legacy device assignment so that they
provide this information to KVM.

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-10 13:25:26 +02:00
Radim Krčmář
42720138b0 KVM: x86: make vapics_in_nmi_mode atomic
Writes were a bit racy, but hard to turn into a bug at the same time.
(Particularly because modern Linux doesn't use this feature anymore.)

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[Actually the next patch makes it much, much easier to trigger the race
 so I'm including this one for stable@ as well. - Paolo]
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-03 18:55:17 +02:00
Linus Torvalds
4e241557fc The bulk of the changes here is for x86. And for once it's not
for silicon that no one owns: these are really new features for
 everyone.
 
 * ARM: several features are in progress but missed the 4.2 deadline.
 So here is just a smattering of bug fixes, plus enabling the VFIO
 integration.
 
 * s390: Some fixes/refactorings/optimizations, plus support for
 2GB pages.
 
 * x86: 1) host and guest support for marking kvmclock as a stable
 scheduler clock. 2) support for write combining. 3) support for
 system management mode, needed for secure boot in guests. 4) a bunch
 of cleanups required for 2+3.  5) support for virtualized performance
 counters on AMD; 6) legacy PCI device assignment is deprecated and
 defaults to "n" in Kconfig; VFIO replaces it.  On top of this there are
 also bug fixes and eager FPU context loading for FPU-heavy guests.
 
 * Common code: Support for multiple address spaces; for now it is
 used only for x86 SMM but the s390 folks also have plans.
 
 There are some x86 conflicts, one with the rc8 pull request and
 the rest with Ingo's FPU rework.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJViYzhAAoJEL/70l94x66Dda0H/1IepMbfEy+o849d5G71fNTs
 F8Y8qUP2GZuL7T53FyFUGSBw+AX7kimu9ia4gR/PmDK+QYsdosYeEjwlsolZfTBf
 sHuzNtPoJhi5o1o/ur4NGameo0WjGK8f1xyzr+U8z74QDQyQv/QYCdK/4isp4BJL
 ugHNHkuROX6Zng4i7jc9rfaSRg29I3GBxQUYpMkEnD3eMYMUBWGm6Rs8pHgGAMvL
 vqzntgW00WNxehTqcAkmD/Wv+txxhkvIadZnjgaxH49e9JeXeBKTIR5vtb7Hns3s
 SuapZUyw+c95DIipXq4EznxxaOrjbebOeFgLCJo8+XMXZum8RZf/ob24KroYad0=
 =YsAR
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull first batch of KVM updates from Paolo Bonzini:
 "The bulk of the changes here is for x86.  And for once it's not for
  silicon that no one owns: these are really new features for everyone.

  Details:

   - ARM:
        several features are in progress but missed the 4.2 deadline.
        So here is just a smattering of bug fixes, plus enabling the
        VFIO integration.

   - s390:
        Some fixes/refactorings/optimizations, plus support for 2GB
        pages.

   - x86:
        * host and guest support for marking kvmclock as a stable
          scheduler clock.
        * support for write combining.
        * support for system management mode, needed for secure boot in
          guests.
        * a bunch of cleanups required for the above
        * support for virtualized performance counters on AMD
        * legacy PCI device assignment is deprecated and defaults to "n"
          in Kconfig; VFIO replaces it

        On top of this there are also bug fixes and eager FPU context
        loading for FPU-heavy guests.

   - Common code:
        Support for multiple address spaces; for now it is used only for
        x86 SMM but the s390 folks also have plans"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
  KVM: s390: clear floating interrupt bitmap and parameters
  KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
  KVM: x86/vPMU: Implement AMD vPMU code for KVM
  KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
  KVM: x86/vPMU: introduce kvm_pmu_msr_idx_to_pmc
  KVM: x86/vPMU: reorder PMU functions
  KVM: x86/vPMU: whitespace and stylistic adjustments in PMU code
  KVM: x86/vPMU: use the new macros to go between PMC, PMU and VCPU
  KVM: x86/vPMU: introduce pmu.h header
  KVM: x86/vPMU: rename a few PMU functions
  KVM: MTRR: do not map huge page for non-consistent range
  KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
  KVM: MTRR: introduce mtrr_for_each_mem_type
  KVM: MTRR: introduce fixed_mtrr_addr_* functions
  KVM: MTRR: sort variable MTRRs
  KVM: MTRR: introduce var_mtrr_range
  KVM: MTRR: introduce fixed_mtrr_segment table
  KVM: MTRR: improve kvm_mtrr_get_guest_memory_type
  KVM: MTRR: do not split 64 bits MSR content
  KVM: MTRR: clean up mtrr default type
  ...
2015-06-24 09:36:49 -07:00
Wei Huang
25462f7f52 KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
This patch defines a new function pointer struct (kvm_pmu_ops) to
support vPMU for both Intel and AMD. The functions pointers defined in
this new struct will be linked with Intel and AMD functions later. In the
meanwhile the struct that maps from event_sel bits to PERF_TYPE_HARDWARE
events is renamed and moved from Intel specific code to kvm_host.h as a
common struct.

Reviewed-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-23 14:12:14 +02:00
Wei Huang
474a5bb944 KVM: x86/vPMU: introduce pmu.h header
This will be used for private function used by AMD- and Intel-specific
PMU implementations.

Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:29 +02:00
Wei Huang
c6702c9dcf KVM: x86/vPMU: rename a few PMU functions
Before introducing a pmu.h header for them, make the naming more
consistent.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:29 +02:00
Xiao Guangrong
19efffa244 KVM: MTRR: sort variable MTRRs
Sort all valid variable MTRRs based on its base address, it will help us to
check a range to see if it's fully contained in variable MTRRs

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Fix list insertion sort, simplify var_mtrr_range_is_valid to just
 test the V bit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:28 +02:00
Xiao Guangrong
86fd52701c KVM: MTRR: do not split 64 bits MSR content
Variable MTRR MSRs are 64 bits which are directly accessed with full length,
no reason to split them to two 32 bits

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:27 +02:00
Xiao Guangrong
10fac2dc2b KVM: MTRR: clean up mtrr default type
Drop kvm_mtrr->enable, omit the decode/code workload and get rid of
all the hard code

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:27 +02:00
Xiao Guangrong
910a6aae4e KVM: MTRR: exactly define the size of variable MTRRs
Only KVM_NR_VAR_MTRR variable MTRRs are available in KVM guest

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:27 +02:00
Xiao Guangrong
70109e7d9d KVM: MTRR: remove mtrr_state.have_fixed
vMTRR does not depend on any host MTRR feature and fixed MTRRs have always
been implemented, so drop this field

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:27 +02:00
Xiao Guangrong
ff53604b40 KVM: x86: move MTRR related code to a separate file
MTRR code locates in x86.c and mmu.c so that move them to a separate file to
make the organization more clearer and it will be the place where we fully
implement vMTRR

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:26 +02:00
Paolo Bonzini
6d396b5520 KVM: x86: advertise KVM_CAP_X86_SMM
... and we're done. :)

Because SMBASE is usually relocated above 1M on modern chipsets, and
SMM handlers might indeed rely on 4G segment limits, we only expose it
if KVM is able to run the guest in big real mode.  This includes any
of VMX+emulate_invalid_guest_state, VMX+unrestricted_guest, or SVM.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-05 17:26:38 +02:00
Paolo Bonzini
699023e239 KVM: x86: add SMM to the MMU role, support SMRAM address space
This is now very simple to do.  The only interesting part is a simple
trick to find the right memslot in gfn_to_rmap, retrieving the address
space from the spte role word.  The same trick is used in the auditing
code.

The comment on top of union kvm_mmu_page_role has been stale forever,
so remove it.  Speaking of stale code, remove pad_for_nice_hex_output
too: it was splitting the "access" bitfield across two bytes and thus
had effectively turned into pad_for_ugly_hex_output.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-05 17:26:37 +02:00
Paolo Bonzini
9da0e4d5ac KVM: x86: work on all available address spaces
This patch has no semantic change, but it prepares for the introduction
of a second address space for system management mode.

A new function x86_set_memory_region (and the "slots_lock taken"
counterpart __x86_set_memory_region) is introduced in order to
operate on all address spaces when adding or deleting private
memory slots.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-05 17:26:37 +02:00
Paolo Bonzini
54bf36aac5 KVM: x86: use vcpu-specific functions to read/write/translate GFNs
We need to hide SMRAM from guests not running in SMM.  Therefore,
all uses of kvm_read_guest* and kvm_write_guest* must be changed to
check whether the VCPU is in system management mode and use a
different set of memslots.  Switch from kvm_* to the newly-introduced
kvm_vcpu_*, which call into kvm_arch_vcpu_memslots_id.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-05 17:26:36 +02:00
Paolo Bonzini
64d6067057 KVM: x86: stubs for SMM support
This patch adds the interface between x86.c and the emulator: the
SMBASE register, a new emulator flag, the RSM instruction.  It also
adds a new request bit that will be used by the KVM_SMI ioctl.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:01:45 +02:00
Paolo Bonzini
f077825a87 KVM: x86: API changes for SMM support
This patch includes changes to the external API for SMM support.
Userspace can predicate the availability of the new fields and
ioctls on a new capability, KVM_CAP_X86_SMM, which is added at the end
of the patch series.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:01:11 +02:00
Paolo Bonzini
609e36d372 KVM: x86: pass host_initiated to functions that read MSRs
SMBASE is only readable from SMM for the VCPU, but it must be always
accessible if userspace is accessing it.  Thus, all functions that
read MSRs are changed to accept a struct msr_data; the host_initiated
and index fields are pre-initialized, while the data field is filled
on return.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:01:00 +02:00
Paolo Bonzini
f36f3f2846 KVM: add "new" argument to kvm_arch_commit_memory_region
This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot.  It will simplify the code when more
than one address space will be supported.

Unfortunately, the "const"ness of the new argument must be casted
away in two places.  Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.

Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-28 10:42:58 +02:00
Ingo Molnar
3152657f10 Linux 4.1-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVYnloAAoJEHm+PkMAQRiGCgkH/j3r2djOOm4h83FXrShaHORY
 p8TBI3FNj4fzLk2PfzqbmiDw2T2CwygB+pxb2Ac9CE99epw8qPk2SRvPXBpdKR7t
 lolhhwfzApLJMZbhzNLVywUCDUhFoiEWRhmPqIfA3WXFcIW3t5VNXAoIFjV5HFr6
 sYUlaxSI1XiQ5tldVv8D6YSFHms41pisziBIZmzhIUg10P6Vv3D0FbE74fjAJwx0
 +08zj3EO7yQMv7Aeeq8F8AJ3628142rcZf0NWF5ohlKLRK3gt0cl9jO5U4Co2dDt
 29v03LP5EI6jDKkIbuWlqRMq9YxJz7N3wnkzV0EJiqXucoqPLFDqzbxB4gnS1pI=
 =7vbA
 -----END PGP SIGNATURE-----

Merge branch 'linus' into x86/fpu

Resolve semantic conflict in arch/x86/kvm/cpuid.c with:

  c447e76b4c ("kvm/fpu: Enable eager restore kvm FPU for MPX")

By removing the FPU internal include files.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-25 09:39:19 +02:00
Paolo Bonzini
a9b4fb7e79 Merge branch 'kvm-master' into kvm-next
Grab MPX bugfix, and fix conflicts against Rik's adaptive FPU
deactivation patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-20 12:31:37 +02:00
Liang Li
c447e76b4c kvm/fpu: Enable eager restore kvm FPU for MPX
The MPX feature requires eager KVM FPU restore support. We have verified
that MPX cannot work correctly with the current lazy KVM FPU restore
mechanism. Eager KVM FPU restore should be enabled if the MPX feature is
exposed to VM.

Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
[Also activate the FPU on AMD processors. - Paolo]
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-20 12:30:26 +02:00
Paolo Bonzini
0fdd74f778 Revert "KVM: x86: drop fpu_activate hook"
This reverts commit 4473b570a7.  We'll
use the hook again.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-20 12:30:15 +02:00
Xiao Guangrong
edc90b7dc4 KVM: MMU: fix SMAP virtualization
KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page

Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
once CR4.SMAP is updated

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-19 20:52:36 +02:00
Ingo Molnar
0ee6a51725 x86/fpu, kvm: Simplify fx_init()
Now that fpstate_init() cannot fail the error return of fx_init()
has lost its purpose. Eliminate the error return and propagate this
change to all callers.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:51 +02:00
Xiao Guangrong
0be0226f07 KVM: MMU: fix SMAP virtualization
KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page

Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
once CR4.SMAP is updated

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-11 17:17:50 +02:00
James Sullivan
93bbf0b8bc kvm: x86: Extended struct kvm_lapic_irq with msi_redir_hint for MSI delivery
Extended struct kvm_lapic_irq with bool msi_redir_hint, which will
be used to determine if the delivery of the MSI should target only
the lowest priority CPU in the logical group specified for delivery.
(In physical dest mode, the RH bit is not relevant). Initialized the value
of msi_redir_hint to true when RH=1 in kvm_set_msi_irq(), and initialized
to false in all other cases.

Added value of msi_redir_hint to a debug message dump of an IRQ in
apic_send_ipi().

Signed-off-by: James Sullivan <sullivan.james.f@gmail.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:29:44 +02:00
Paolo Bonzini
b7cb223173 KVM: x86: tweak types of fields in kvm_lapic_irq
Change to u16 if they only contain data in the low 16 bits.

Change the level field to bool, since we assign 1 sometimes, but
just mask icr_low with APIC_INT_ASSERT in apic_send_ipi.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:29:43 +02:00
Nadav Amit
d28bc9dd25 KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences.
INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.

References (from Intel SDM):

"If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
to a specific processor or system wide) do not cause the MP protocol to be
repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]

[Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]

"If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
changed." [9.2: X87 FPU INITIALIZATION]

"The state of the local APIC following an INIT reset is the same as it is after
a power-up or hardware reset, except that the APIC ID and arbitration ID
registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
("Wait-for-SIPI" State)]

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:29:43 +02:00
Nadav Amit
90de4a1875 KVM: x86: Support for disabling quirks
Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous
created in order to overcome QEMU issues. Those issue were mostly result of
invalid VM BIOS.  Currently there are two quirks that can be disabled:

1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot
2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot

These two issues are already resolved in recent releases of QEMU, and would
therefore be disabled by QEMU.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428879221-29996-1-git-send-email-namit@cs.technion.ac.il>
[Report capability from KVM_CHECK_EXTENSION too. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:29:42 +02:00
Wanpeng Li
3ea3b7fa9a kvm: mmu: lazy collapse small sptes into large sptes
Dirty logging tracks sptes in 4k granularity, meaning that large sptes
have to be split.  If live migration is successful, the guest in the
source machine will be destroyed and large sptes will be created in the
destination. However, the guest continues to run in the source machine
(for example if live migration fails), small sptes will remain around
and cause bad performance.

This patch introduce lazy collapsing of small sptes into large sptes.
The rmap will be scanned in ioctl context when dirty logging is stopped,
dropping those sptes which can be collapsed into a single large-page spte.
Later page faults will create the large-page sptes.

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Message-Id: <1428046825-6905-1-git-send-email-wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:47:04 +02:00
Nadav Amit
ae561edeb4 KVM: x86: DR0-DR3 are not clear on reset
DR0-DR3 are not cleared as they should during reset and when they are set from
userspace.  It appears to be caused by c77fb5fe6f ("KVM: x86: Allow the guest
to run with dirty debug registers").

Force their reload on these situations.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427933438-12782-4-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:47:03 +02:00
Radim Krčmář
3b5a5ffa92 KVM: x86: simplify kvm_apic_map
recalculate_apic_map() uses two passes over all VCPUs.  This is a relic
from time when we selected a global mode in the first pass and set up
the optimized table in the second pass (to have a consistent mode).

Recent changes made mixed mode unoptimized and we can do it in one pass.
Format of logical MDA is a function of the mode, so we encode it in
apic_logical_id() and drop obsoleted variables from the struct.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <1423766494-26150-5-git-send-email-rkrcmar@redhat.com>
[Add lid_bits temporary in apic_logical_id. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:47:01 +02:00
Radim Krčmář
3548a259f6 KVM: x86: avoid logical_map when it is invalid
We want to support mixed modes and the easiest solution is to avoid
optimizing those weird and unlikely scenarios.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <1423766494-26150-4-git-send-email-rkrcmar@redhat.com>
[Add comment above KVM_APIC_MODE_* defines. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:47:01 +02:00
Radim Krčmář
9ea369b032 KVM: x86: fix mixed APIC mode broadcast
Broadcast allowed only one global APIC mode, but mixed modes are
theoretically possible.  x2APIC IPI doesn't mean 0xff as broadcast,
the rest does.

x2APIC broadcasts are accepted by xAPIC.  If we take SDM to be logical,
even addreses beginning with 0xff should be accepted, but real hardware
disagrees.  This patch aims for simple code by considering most of real
behavior as undefined.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <1423766494-26150-3-git-send-email-rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:47:00 +02:00
Eugene Korenevsky
5a4f55cde8 KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
cpuid_maxphyaddr(), which performs lot of memory accesses is called
extensively across KVM, especially in nVMX code.

This patch adds a cached value of maxphyaddr to vcpu.arch to reduce the
pressure onto CPU cache and simplify the code of cpuid_maxphyaddr()
callers. The cached value is initialized in kvm_arch_vcpu_init() and
reloaded every time CPUID is updated by usermode. It is obvious that
these reloads occur infrequently.

Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Message-Id: <20150329205612.GA1223@gnote>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:46:56 +02:00
Nadav Amit
b32a991800 KVM: x86: Remove redundant definitions
Some constants are redfined in emulate.c. Avoid it.

s/SELECTOR_RPL_MASK/SEGMENT_RPL_MASK
s/SELECTOR_TI_MASK/SEGMENT_TI_MASK

No functional change.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427635984-8113-3-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-03-30 16:46:42 +02:00
Nadav Amit
0efb04406d KVM: x86: removing redundant eflags bits definitions
The eflags are redefined (using other defines) in emulate.c.
Use the definition from processor-flags.h as some mess already started.
No functional change.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427635984-8113-2-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-03-30 16:46:37 +02:00
Joel Schopp
5cb56059c9 kvm: x86: make kvm_emulate_* consistant
Currently kvm_emulate() skips the instruction but kvm_emulate_* sometimes
don't.  The end reult is the caller ends up doing the skip themselves.
Let's make them consistant.

Signed-off-by: Joel Schopp <joel.schopp@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-10 20:29:15 -03:00
Paolo Bonzini
f781951299 kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.

This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest.  KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too.  When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.

With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest.  This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.

Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host.  The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.

The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter.  It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.

While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls.  During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark.  The wasted time is thus very low.  Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.

The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer.  Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-06 13:08:37 +01:00
Tiejun Chen
1c2b364b22 kvm: remove KVM_MMIO_SIZE
After f78146b0f9, "KVM: Fix page-crossing MMIO", and
87da7e66a4, "KVM: x86: fix vcpu->mmio_fragments overflow",
actually KVM_MMIO_SIZE is gone.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-05 12:26:14 +01:00
Marcelo Tosatti
2e6d015799 KVM: x86: revert "add method to test PIR bitmap vector"
Revert 7c6a98dfa1, given
that testing PIR is not necessary anymore.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-02 18:36:34 +01:00
Kai Huang
88178fd4f7 KVM: x86: Add new dirty logging kvm_x86_ops for PML
This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty
logging for particular memory slot, and to flush potentially logged dirty GPAs
before reporting slot->dirty_bitmap to userspace.

kvm x86 common code calls these hooks when they are available so PML logic can
be hidden to VMX specific. SVM won't be impacted as these hooks remain NULL
there.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:31:41 +01:00
Kai Huang
1c91cad423 KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access
This patch changes the second parameter of kvm_mmu_slot_remove_write_access from
'slot id' to 'struct kvm_memory_slot *' to align with kvm_x86_ops dirty logging
hooks, which will be introduced in further patch.

Better way is to change second parameter of kvm_arch_commit_memory_region from
'struct kvm_userspace_memory_region *' to 'struct kvm_memory_slot * new', but it
requires changes on other non-x86 ARCH too, so avoid it now.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:31:37 +01:00
Kai Huang
f4b4b18086 KVM: MMU: Add mmu help functions to support PML
This patch adds new mmu layer functions to clear/set D-bit for memory slot, and
to write protect superpages for memory slot.

In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU
updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages,
instead, we only need to clear D-bit in order to log that GPA.

For superpages, we still write protect it and let page fault code to handle
dirty page logging, as we still need to split superpage to 4K pages in PML.

As PML is always enabled during guest's lifetime, to eliminate unnecessary PML
GPA logging, we set D-bit manually for the slot with dirty logging disabled.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:31:29 +01:00
Paolo Bonzini
1c6007d59a KVM/ARM changes for v3.20 including GICv3 emulation, dirty page logging, added
trace symbols, and adding an explicit VGIC init device control IOCTL.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUwhsKAAoJEEtpOizt6ddyuSEH/ia2uf07N0i+C1dPKYiqhKEd
 nFqBvgrhAMVztWLmy1Wq4SnO9YNd+CrPYATrfCiYsYQ9aKc09+qDq+uo06bVpZXz
 KsHjVGUsdyJ4qRqjDixkPvZviGIXa6C//+hcwg1XH2nit1uHmXVupzB9dDz3ZM2l
 GCwApdRdaaUVDt5Ud2ljqIWZa18Qf/5/HD8MdPXpmotDOKucL6pBr/1R1XWueCU/
 ejRs/qy3EFyMWdEdfGFAMCa0ZvHbPmsJmvB/EgkyUnuJj77ptA0jNo1jtzSfEyis
 53x4ffWnIsPl9yqhk0oKerIALVUvV4A7/me2ya6tsQ5fiBX7lJ3+qwggvCkWQzw=
 =fMS2
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-next

KVM/ARM changes for v3.20 including GICv3 emulation, dirty page logging, added
trace symbols, and adding an explicit VGIC init device control IOCTL.

Conflicts:
	arch/arm64/include/asm/kvm_arm.h
	arch/arm64/kvm/handle_exit.c
2015-01-23 13:39:51 +01:00
Borislav Petkov
cfaa790a3f kvm: Fix CR3_PCID_INVD type on 32-bit
arch/x86/kvm/emulate.c: In function ‘check_cr_write’:
arch/x86/kvm/emulate.c:3552:4: warning: left shift count >= width of type
    rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;

happens because sizeof(UL) on 32-bit is 4 bytes but we shift it 63 bits
to the left.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-21 15:59:09 +01:00
Marcelo Tosatti
54750f2cf0 KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue
SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp
and rdtsc is larger than a given threshold:

 * If we get more than the below threshold into the future, we rerequest
 * the real time from the host again which has only little offset then
 * that we need to adjust using the TSC.
 *
 * For now that threshold is 1/5th of a jiffie. That should be good
 * enough accuracy for completely broken systems, but also give us swing
 * to not call out to the host all the time.
 */
#define PVCLOCK_DELTA_MAX ((1000000000ULL / HZ) / 5)

Disable masterclock support (which increases said delta) in case the
boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW.

Upstreams kernels which support pvclock vsyscalls (and therefore make
use of PVCLOCK_STABLE_BIT) use MSR_KVM_SYSTEM_TIME_NEW.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-20 20:38:39 +01:00
Paolo Bonzini
e108ff2f80 KVM: x86: switch to kvm_get_dirty_log_protect
We now have a generic function that does most of the work of
kvm_vm_ioctl_get_dirty_log, now use it.

Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
2015-01-16 14:40:14 +01:00
Nadav Amit
c205fb7d7d KVM: x86: #PF error-code on R/W operations is wrong
When emulating an instruction that reads the destination memory operand (i.e.,
instructions without the Mov flag in the emulator), the operand is first read.
If a page-fault is detected in this phase, the error-code which would be
delivered to the VM does not indicate that the access that caused the exception
is a write one. This does not conform with real hardware, and may cause the VM
to enter the page-fault handler twice for no reason (once for read, once for
write).

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-09 10:24:11 +01:00
Marcelo Tosatti
7c6a98dfa1 KVM: x86: add method to test PIR bitmap vector
kvm_x86_ops->test_posted_interrupt() returns true/false depending
whether 'vector' is set.

Next patch makes use of this interface.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:18 +01:00
Paolo Bonzini
cb5281a572 KVM: move APIC types to arch/x86/
They are not used anymore by IA64, move them away.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-18 09:39:51 +01:00
Wanpeng Li
203000993d kvm: vmx: add MSR logic for XSAVES
Add logic to get/set the XSS model-specific register.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05 13:57:39 +01:00
Wanpeng Li
55412b2eda kvm: x86: Add kvm_x86_ops hook that enables XSAVES for guest
Expose the XSAVES feature to the guest if the kvm_x86_ops say it is
available.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05 13:57:16 +01:00
Paolo Bonzini
2b4a273b42 kvm: x86: avoid warning about potential shift wrapping bug
cs.base is declared as a __u64 variable and vector is a u32 so this
causes a static checker warning.  The user indeed can set "sipi_vector"
to any u32 value in kvm_vcpu_ioctl_x86_set_vcpu_events(), but the
value should really have 8-bit precision only.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-24 16:53:50 +01:00
Paolo Bonzini
c9eab58f64 KVM: x86: move device assignment out of kvm_host.h
Create a new header, and hide the device assignment functions there.
Move struct kvm_assigned_dev_kernel to assigned-dev.c by modifying
arch/x86/kvm/iommu.c to take a PCI device struct.

Based on a patch by Radim Krcmar <rkrcmark@redhat.com>.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-24 16:53:50 +01:00
Radim Krčmář
c274e03af7 kvm: x86: move assigned-dev.c and iommu.c to arch/x86/
Now that ia64 is gone, we can hide deprecated device assignment in x86.

Notable changes:
 - kvm_vm_ioctl_assigned_device() was moved to x86/kvm_arch_vm_ioctl()

The easy parts were removed from generic kvm code, remaining
 - kvm_iommu_(un)map_pages() would require new code to be moved
 - struct kvm_assigned_dev_kernel depends on struct kvm_irq_ack_notifier

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-23 18:33:36 +01:00
Paolo Bonzini
6ef768fac9 kvm: x86: move ioapic.c and irq_comm.c back to arch/x86/
ia64 does not need them anymore.  Ack notifiers become x86-specific
too.

Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-21 18:02:37 +01:00
Igor Mammedov
1d4e7e3c0b kvm: x86: increase user memory slots to 509
With the 3 private slots, this gives us 512 slots total.
Motivation for this is in addition to assigned devices
support more memory hotplug slots, where 1 slot is
used by a hotplugged memory stick.
It will allow to support upto 256 hotplug memory
slots and leave 253 slots for assigned devices and
other devices that use them.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-14 10:02:40 +01:00
Nadav Amit
9d88fca71a KVM: x86: MOV to CR3 can set bit 63
Although Intel SDM mentions bit 63 is reserved, MOV to CR3 can have bit 63 set.
As Intel SDM states in section 4.10.4 "Invalidation of TLBs and
Paging-Structure Caches": " MOV to CR3. ... If CR4.PCIDE = 1 and bit 63 of the
instruction’s source operand is 0 ..."

In other words, bit 63 is not reserved. KVM emulator currently consider bit 63
as reserved. Fix it.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:07 +01:00
Nadav Amit
82b32774c2 KVM: x86: Breakpoints do not consider CS.base
x86 debug registers hold a linear address. Therefore, breakpoints detection
should consider CS.base, and check whether instruction linear address equals
(CS.base + RIP). This patch introduces a function to evaluate RIP linear
address and uses it for breakpoints detection.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:04 +01:00
Nadav Amit
16f8a6f979 KVM: vmx: Unavailable DR4/5 is checked before CPL
If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD
should be generated even if CPL>0. This is according to Intel SDM Table 6-2:
"Priority Among Simultaneous Exceptions and Interrupts".

Note, that this may happen on the first DR access, even if the host does not
sets debug breakpoints. Obviously, it occurs when the host debugs the guest.

This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr.
The emulator already checks DR4/5 availability in check_dr_read. Nested
virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject
exceptions to the guest.

As for SVM, the patch follows the previous logic as much as possible. Anyhow,
it appears the DR interception code might be buggy - even if the DR access
may cause an exception, the instruction is skipped.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03 12:07:26 +01:00
Nadav Amit
394457a928 KVM: x86: some apic broadcast modes does not work
KVM does not deliver x2APIC broadcast messages with physical mode.  Intel SDM
(10.12.9 ICR Operation in x2APIC Mode) states: "A destination ID value of
FFFF_FFFFH is used for broadcast of interrupts in both logical destination and
physical destination modes."

In addition, the local-apic enables cluster mode broadcast. As Intel SDM
10.6.2.2 says: "Broadcast to all local APICs is achieved by setting all
destination bits to one." This patch enables cluster mode broadcast.

The fix tries to combine broadcast in different modes through a unified code.

One rare case occurs when the source of IPI has its APIC disabled.  In such
case, the source can still issue IPIs, but since the source is not obliged to
have the same LAPIC mode as the enabled ones, we cannot rely on it.
Since it is a rare case, it is unoptimized and done on the slow-path.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[As per Radim's review, use unsigned int for X2APIC_BROADCAST, return bool from
 kvm_apic_broadcast. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03 12:07:22 +01:00
Andy Honig
8b3c3104c3 KVM: x86: Prevent host from panicking on shared MSR writes.
The previous patch blocked invalid writes directly when the MSR
is written.  As a precaution, prevent future similar mistakes by
gracefulling handle GPs caused by writes to shared MSRs.

Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
[Remove parts obsoleted by Nadav's patch. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24 13:21:08 +02:00
Nadav Amit
854e8bb1aa KVM: x86: Check non-canonical addresses upon WRMSR
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).

Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD.  To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.

Some references from Intel and AMD manuals:

According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."

According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers.  If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."

This patch fixes CVE-2014-3610.

Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24 13:21:08 +02:00
Tang Chen
c24ae0dcd3 kvm: x86: Unpin and remove kvm_arch->apic_access_page
In order to make the APIC access page migratable, stop pinning it in
memory.

And because the APIC access page is not pinned in memory, we can
remove kvm_arch->apic_access_page.  When we need to write its
physical address into vmcs, we use gfn_to_page() to get its page
struct, which is needed to call page_to_phys(); the page is then
immediately unpinned.

Suggested-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:08:01 +02:00
Tang Chen
4256f43f9f kvm: x86: Add request bit to reload APIC access page address
Currently, the APIC access page is pinned by KVM for the entire life
of the guest.  We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.

This patch prepares to handle this in generic code, through a new
request bit (that will be set by the MMU notifier) and a new hook
that is called whenever the request bit is processed.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:08:00 +02:00
Tang Chen
fe71557afb kvm: Add arch specific mmu notifier for page invalidation
This will be used to let the guest run while the APIC access page is
not pinned.  Because subsequent patches will fill in the function
for x86, place the (still empty) x86 implementation in the x86.c file
instead of adding an inline function in kvm_host.h.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:59 +02:00
Andres Lagar-Cavilla
5712846808 kvm: Fix page ageing bugs
1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:58 +02:00
Liang Chen
77c3913b74 KVM: x86: directly use kvm_make_request again
A one-line wrapper around kvm_make_request is not particularly
useful. Replace kvm_mmu_flush_tlb() with kvm_make_request().

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:51 +02:00
Tang Chen
a255d4795f kvm: Remove ept_identity_pagetable from struct kvm_arch.
kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.

In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized

Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.

NOTE: In the original code, ept identity pagetable page is pinned in memroy.
      As a result, it cannot be migrated/hot-removed. After this patch, since
      kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
      is no longer pinned in memory. And it can be migrated/hot-removed.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-17 13:10:11 +02:00
Paolo Bonzini
54987b7afa KVM: x86: propagate exception from permission checks on the nested page fault
Currently, if a permission error happens during the translation of
the final GPA to HPA, walk_addr_generic returns 0 but does not fill
in walker->fault.  To avoid this, add an x86_exception* argument
to the translate_gpa function, and let it fill in walker->fault.
The nested_page_fault field will be true, since the walk_mmu is the
nested_mmu and translate_gpu instead operates on the "outer" (NPT)
instance.

Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-05 12:01:13 +02:00
Paolo Bonzini
ef54bcfeea KVM: x86: skip writeback on injection of nested exception
If a nested page fault happens during emulation, we will inject a vmexit,
not a page fault.  However because writeback happens after the injection,
we will write ctxt->eip from L2 into the L1 EIP.  We do not write back
if an instruction caused an interception vmexit---do the same for page
faults.

Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-05 12:01:06 +02:00
David Matlack
56f17dd3fb kvm: x86: fix stale mmio cache bug
The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by using the memslot generation number
to validate the mmio cache.

Cc: stable@vger.kernel.org
Signed-off-by: David Matlack <dmatlack@google.com>
[xiaoguangrong: adjust the code to make it simpler for stable-tree fix.]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Tested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-03 10:03:42 +02:00
Radim Krčmář
13a34e067e KVM: remove garbage arg to *hardware_{en,dis}able
In the beggining was on_each_cpu(), which required an unused argument to
kvm_arch_ops.hardware_{en,dis}able, but this was soon forgotten.

Remove unnecessary arguments that stem from this.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 16:35:55 +02:00
Paolo Bonzini
656473003b KVM: forward declare structs in kvm_types.h
Opaque KVM structs are useful for prototypes in asm/kvm_host.h, to avoid
"'struct foo' declared inside parameter list" warnings (and consequent
breakage due to conflicting types).

Move them from individual files to a generic place in linux/kvm_types.h.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 16:35:53 +02:00
Radim Krčmář
ae97a3b818 KVM: x86: introduce sched_in to kvm_x86_ops
sched_in preempt notifier is available for x86, allow its use in
specific virtualization technlogies as well.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-21 18:45:22 +02:00
Wanpeng Li
4473b570a7 KVM: x86: drop fpu_activate hook
The only user of the fpu_activate hook was dropped in commit
2d04a05bd7 (KVM: x86 emulator: emulate CLTS internally, 2011-04-20).
vmx_fpu_activate and svm_fpu_activate are still called on #NM (and for
Intel CLTS), but never from common code; hence, there's no need for
a hook.

Reviewed-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-19 15:12:28 +02:00
Paolo Bonzini
0d234daf7e Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"
This reverts commit 682367c494,
which causes 32-bit SMP Windows 7 guests to panic.

SeaBIOS has a limit on the number of MTRRs that it can handle,
and this patch exceeded the limit.  Better revert it.
Thanks to Nadav Amit for debugging the cause.

Cc: stable@nongnu.org
Reported-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-19 15:12:28 +02:00
Linus Torvalds
8533ce7271 These are the x86, MIPS and s390 changes; PPC and ARM will come in a
few days.
 
 MIPS and s390 have little going on this release; just bugfixes, some
 small, some larger.
 
 The highlights for x86 are nested VMX improvements (Jan Kiszka), optimizations
 for old processor (up to Nehalem, by me and Bandan Das), and a lot of x86
 emulator bugfixes (Nadav Amit).
 
 Stephen Rothwell reported a trivial conflict with the tracing branch.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJT300XAAoJEBvWZb6bTYby3V8QAJz+XyajnhJ8wH55Vxczz22L
 i2gtUGmBLhEXsBcaVKO4BBfek88lLzg0SGLjfW5wCMQmKtxVlrwTCXNkBoPGjapd
 NwHtWkMKym44PDhRovn7zkSumkxC43uFIBR/ebrhP6Bvhh9s+MnkQUxfw9ILB+YV
 EeKyEG8sSgxFCciuHbp3mIXpDcO6r/ldy6I7009OdyhLoMY+Kvmk7kRe9wtAivdg
 CGJi60QvGOn2RGRPOCEtF6UWr8Ae8fe1t84o0hkXPv/j3jtabzAatXKJa4dYNbIs
 7Mp4NQpxaGV6rq3WCYVeZRxGs+UReGDAS3Il4Z8C9eTOTooSfxdVr8acpM8PY6I8
 UmLT6ECLGycc4ELXrETtR+QLmiXACyJqyVxz4aiLV3kWSWfamKD3hBeQK9NizNcE
 VoPDl+PyISvR1tW4KstBuzfUWAEXi+gO78cqqFr/VW6cl7HKpA1DFQaPfGkYKDae
 2CPwcLwI5/M6RtSgkyXTkEqNZLc2BjldqSeM1lmWjhZVW56X2iqePUL46Vab3Yvt
 U+sELtwEE560NLN3hbaHUsLR1tcUix5w8vTzcXPxgoHQBszHCcAZTWd1XHulr64F
 rp/cangqtkPKcu5j1mNhQs38oLjHI1MUsbQrqFoD4tmHjQ75iXHRFzYGoIVKXyHG
 AnGbQzJzBcdAANhm3LW0
 =UXxV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM changes from Paolo Bonzini:
 "These are the x86, MIPS and s390 changes; PPC and ARM will come in a
  few days.

  MIPS and s390 have little going on this release; just bugfixes, some
  small, some larger.

  The highlights for x86 are nested VMX improvements (Jan Kiszka),
  optimizations for old processor (up to Nehalem, by me and Bandan Das),
  and a lot of x86 emulator bugfixes (Nadav Amit).

  Stephen Rothwell reported a trivial conflict with the tracing branch"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (104 commits)
  x86/kvm: Resolve shadow warnings in macro expansion
  KVM: s390: rework broken SIGP STOP interrupt handling
  KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table
  KVM: vmx: remove duplicate vmx_mpx_supported() prototype
  KVM: s390: Fix memory leak on busy SIGP stop
  x86/kvm: Resolve shadow warning from min macro
  kvm: Resolve missing-field-initializers warnings
  Replace NR_VMX_MSR with its definition
  KVM: x86: Assertions to check no overrun in MSR lists
  KVM: x86: set rflags.rf during fault injection
  KVM: x86: Setting rflags.rf during rep-string emulation
  KVM: x86: DR6/7.RTM cannot be written
  KVM: nVMX: clean up nested_release_vmcs12 and code around it
  KVM: nVMX: fix lifetime issues for vmcs02
  KVM: x86: Defining missing x86 vectors
  KVM: x86: emulator injects #DB when RFLAGS.RF is set
  KVM: x86: Cleanup of rflags.rf cleaning
  KVM: x86: Clear rflags.rf on emulated instructions
  KVM: x86: popf emulation should not change RF
  KVM: x86: Clearing rflags.rf upon skipped emulated instruction
  ...
2014-08-04 12:16:46 -07:00
Nadav Amit
6f43ed01e8 KVM: x86: DR6/7.RTM cannot be written
Haswell and newer Intel CPUs have support for RTM, and in that case DR6.RTM is
not fixed to 1 and DR7.RTM is not fixed to zero. That is not the case in the
current KVM implementation. This bug is apparent only if the MOV-DR instruction
is emulated or the host also debugs the guest.

This patch is a partial fix which enables DR6.RTM and DR7.RTM to be cleared and
set respectively. It also sets DR6.RTM upon every debug exception. Obviously,
it is not a complete fix, as debugging of RTM is still unsupported.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-21 17:17:52 +02:00
Paolo Bonzini
37ccdcbe07 KVM: x86: return all bits from get_interrupt_shadow
For the next patch we will need to know the full state of the
interrupt shadow; we will then set KVM_REQ_EVENT when one bit
is cleared.

However, right now get_interrupt_shadow only returns the one
corresponding to the emulated instruction, or an unconditional
0 if the emulated instruction does not have an interrupt shadow.
This is confusing and does not allow us to check for cleared
bits as mentioned above.

Clean the callback up, and modify toggle_interruptibility to
match the comment above the call.  As a small result, the
call to set_interrupt_shadow will be skipped in the common
case where int_shadow == 0 && mask == 0.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:13:56 +02:00
Tomasz Grabiec
0d3da0d26e KVM: x86: fix TSC matching
I've observed kvmclock being marked as unstable on a modern
single-socket system with a stable TSC and qemu-1.6.2 or qemu-2.0.0.

The culprit was failure in TSC matching because of overflow of
kvm_arch::nr_vcpus_matched_tsc in case there were multiple TSC writes
in a single synchronization cycle.

Turns out that qemu does multiple TSC writes during init, below is the
evidence of that (qemu-2.0.0):

The first one:

 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
 0xffffffffa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm]
 0xffffffffa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm]

The second one:

 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm]
 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm]
 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]

 #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
 #1  kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
 #2  kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
 #3  kvm_cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1641
 #4  cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:330
 #5  cpu_synchronize_all_post_init () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:521
 #6  main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4390

The third one:

 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm]
 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm]
 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]

 #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
 #1  kvm_put_msrs  at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
 #2  kvm_arch_put_registers  at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
 #3  kvm_cpu_synchronize_post_reset  at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1635
 #4  cpu_synchronize_post_reset  at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:323
 #5  cpu_synchronize_all_post_reset () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:512
 #6  main  at /build/buildd/qemu-2.0.0+dfsg/vl.c:4482

The fix is to count each vCPU only once when matched, so that
nr_vcpus_matched_tsc holds the size of the matched set. This is
achieved by reusing generation counters. Every vCPU with
this_tsc_generation == cur_tsc_generation is in the matched set. The
match set is cleared by setting cur_tsc_generation to a value which no
other vCPU is set to (by incrementing it).

I needed to bump up the counter size form u8 to u64 to ensure it never
overflows. Otherwise in cases TSC is not written the same number of
times on each vCPU the counter could overflow and incorrectly indicate
some vCPUs as being in the matched set. This scenario seems unlikely
but I'm not sure if it can be disregarded.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09 18:09:57 +02:00
Paolo Bonzini
7cb060a91c KVM: x86: preserve the high 32-bits of the PAT register
KVM does not really do much with the PAT, so this went unnoticed for a
long time.  It is exposed however if you try to do rdmsr on the PAT
register.

Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 13:43:44 +02:00
Nadav Amit
682367c494 KVM: x86: Increase the number of fixed MTRR regs to 10
Recent Intel CPUs have 10 variable range MTRRs. Since operating systems
sometime make assumptions on CPUs while they ignore capability MSRs, it is
better for KVM to be consistent with recent CPUs. Reporting more MTRRs than
actually supported has no functional implications.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 11:41:14 +02:00
Nadav Amit
67f4d4288c KVM: x86: rdpmc emulation checks the counter incorrectly
The rdpmc emulation checks that the counter (ECX) is not higher than 2, without
taking into considerations bits 30:31 role (e.g., bit 30 marks whether the
counter is fixed). The fix uses the pmu information for checking the validity
of the pmu counter.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-18 17:46:18 +02:00
Paolo Bonzini
ae9fedc793 KVM: x86: get CPL from SS.DPL
CS.RPL is not equal to the CPL in the few instructions between
setting CR0.PE and reloading CS.  And CS.DPL is also not equal
to the CPL for conforming code segments.

However, SS.DPL *is* always equal to the CPL except for the weird
case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the
value in the STAR MSR, but force CPL=3 (Intel instead forces
SS.DPL=SS.RPL=CPL=3).

So this patch:

- modifies SVM to update the CPL from SS.DPL rather than CS.RPL;
the above case with SYSRET is not broken further, and the way
to fix it would be to pass the CPL to userspace and back

- modifies VMX to always return the CPL from SS.DPL (except
forcing it to 0 if we are emulating real mode via vm86 mode;
in vm86 mode all DPLs have to be 3, but real mode does allow
privileged instructions).  It also removes the CPL cache,
which becomes a duplicate of the SS access rights cache.

This fixes doing KVM_IOCTL_SET_SREGS exactly after setting
CR0.PE=1 but before CS has been reloaded.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22 17:47:17 +02:00
Nadav Amit
346874c950 KVM: x86: Fix CR3 reserved bits
According to Intel specifications, PAE and non-PAE does not have any reserved
bits.  In long-mode, regardless to PCIDE, only the high bits (above the
physical address) are reserved.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23 17:46:57 -03:00
Feng Wu
56d6efc2de KVM: Remove SMAP bit from CR4_RESERVED_BITS
This patch removes SMAP bit from CR4_RESERVED_BITS.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-14 17:50:33 -03:00
Paolo Bonzini
c77fb5fe6f KVM: x86: Allow the guest to run with dirty debug registers
When not running in guest-debug mode, the guest controls the debug
registers and having to take an exit for each DR access is a waste
of time.  If the guest gets into a state where each context switch
causes DR to be saved and restored, this can take away as much as 40%
of the execution time from the guest.

After this patch, VMX- and SVM-specific code can set a flag in
switch_db_regs, telling vcpu_enter_guest that on the next exit the debug
registers might be dirty and need to be reloaded (syncing will be taken
care of by a new callback in kvm_x86_ops).  This flag can be set on the
first access to a debug registers, so that multiple accesses to the
debug registers only cause one vmexit.

Note that since the guest will be able to read debug registers and
enable breakpoints in DR7, we need to ensure that they are synchronized
on entry to the guest---including DR6 that was not synced before.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:02 +01:00
Paolo Bonzini
360b948d88 KVM: x86: change vcpu->arch.switch_db_regs to a bit mask
The next patch will add another bit that we can test with the
same "if".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:02 +01:00
Jan Kiszka
c9a7953f09 KVM: x86: Remove return code from enable_irq/nmi_window
It's no longer possible to enter enable_irq_window in guest mode when
L1 intercepts external interrupts and we are entering L2. This is now
caught in vcpu_enter_guest. So we can remove the check from the VMX
version of enable_irq_window, thus the need to return an error code from
both enable_irq_window and enable_nmi_window.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:47 +01:00
Jan Kiszka
b6b8a1451f KVM: nVMX: Rework interception of IRQs and NMIs
Move the check for leaving L2 on pending and intercepted IRQs or NMIs
from the *_allowed handler into a dedicated callback. Invoke this
callback at the relevant points before KVM checks if IRQs/NMIs can be
injected. The callback has the task to switch from L2 to L1 if needed
and inject the proper vmexit events.

The rework fixes L2 wakeups from HLT and provides the foundation for
preemption timer emulation.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:45 +01:00
Andrew Jones
332967a3ea x86: kvm: introduce periodic global clock updates
commit 0061d53daf introduced a mechanism to execute a global clock
update for a vm. We can apply this periodically in order to propagate
host NTP corrections. Also, if all vcpus of a vm are pinned, then
without an additional trigger, no guest NTP corrections can propagate
either, as the current trigger is only vcpu cpu migration.

Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-04 11:50:54 +01:00
Andrew Jones
7e44e4495a x86: kvm: rate-limit global clock updates
When we update a vcpu's local clock it may pick up an NTP correction.
We can't wait an indeterminate amount of time for other vcpus to pick
up that correction, so commit 0061d53daf introduced a global clock
update. However, we can't request a global clock update on every vcpu
load either (which is what happens if the tsc is marked as unstable).
The solution is to rate-limit the global clock updates. Marcelo
calculated that we should delay the global clock updates no more
than 0.1s as follows:

Assume an NTP correction c is applied to one vcpu, but not the other,
then in n seconds the delta of the vcpu system_timestamps will be
c * n. If we assume a correction of 500ppm (worst-case), then the two
vcpus will diverge 50us in 0.1s, which is a considerable amount.

Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-04 11:50:47 +01:00
Liu, Jinsong
da8999d318 KVM: x86: Intel MPX vmx and msr handle
From caddc009a6d2019034af8f2346b2fd37a81608d0 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:11:11 +0800
Subject: [PATCH v5 1/3] KVM: x86: Intel MPX vmx and msr handle

This patch handle vmx and msr of Intel MPX feature.

Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-24 12:14:00 +01:00
Marcelo Tosatti
4f34d683e5 KVM: x86: remove unused last_kernel_ns variable
Remove unused last_kernel_ns variable.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-04 04:20:42 +01:00
Jan Kiszka
73aaf249ee KVM: SVM: Fix reading of DR6
In contrast to VMX, SVM dose not automatically transfer DR6 into the
VCPU's arch.dr6. So if we face a DR6 read, we must consult a new vendor
hook to obtain the current value. And as SVM now picks the DR6 state
from its VMCB, we also need a set callback in order to write updates of
DR6 back.

Fixes a regression of 020df0794f.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:10 +01:00
Vadim Rozenfeld
e984097b55 add support for Hyper-V reference time counter
Signed-off: Peter Lieven <pl@kamp.de>
Signed-off: Gleb Natapov
Signed-off: Vadim Rozenfeld <vrozenfe@redhat.com>

After some consideration I decided to submit only Hyper-V reference
counters support this time. I will submit iTSC support as a separate
patch as soon as it is ready.

v1 -> v2
1. mark TSC page dirty as suggested by
    Eric Northup <digitaleric@google.com> and Gleb
2. disable local irq when calling get_kernel_ns,
    as it was done by Peter Lieven <pl@amp.de>
3. move check for TSC page enable from second patch
    to this one.

v3 -> v4
    Get rid of ref counter offset.

v4 -> v5
    replace __copy_to_user with kvm_write_guest
    when updateing iTSC page.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:08 +01:00
Josh Triplett
a890b6fefd kvm: Delete prototype for non-existent function kvm_check_iopl
The prototype for kvm_check_iopl appeared in commit
f850e2e603 ("KVM: x86 emulator: Check IOPL
level during io instruction emulation"), but the function never actually
existed.  Remove the prototype.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-11-06 11:07:09 +02:00
Josh Triplett
35a5121b58 kvm: Delete prototype for non-existent function complete_pio
complete_pio ceased to exist in commit
7972995b0c ("KVM: x86 emulator: Move
string pio emulation into emulator.c"), but the prototype remained.
Remove its prototype.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-11-06 11:06:32 +02:00
Alex Williamson
e0f0bbc527 kvm: Create non-coherent DMA registeration
We currently use some ad-hoc arch variables tied to legacy KVM device
assignment to manage emulation of instructions that depend on whether
non-coherent DMA is present.  Create an interface for this, adapting
legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
For now we assume that non-coherent DMA is possible any time we have a
VFIO group.  Eventually an interface can be developed as part of the
VFIO external user interface to query the coherency of a group.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 19:02:23 +01:00
Alex Williamson
d96eb2c6f4 kvm/x86: Convert iommu_flags to iommu_noncoherent
Default to operating in coherent mode.  This simplifies the logic when
we switch to a model of registering and unregistering noncoherent I/O
with KVM.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 19:02:13 +01:00
Christoffer Dall
6d9d41e574 KVM: Move gfn_to_index to x86 specific code
The gfn_to_index function relies on huge page defines which either may
not make sense on systems that don't support huge pages or are defined
in an unconvenient way for other architectures.  Since this is
x86-specific, move the function to arch/x86/include/asm/kvm_host.h.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-14 10:11:44 +03:00
Paolo Bonzini
8a3c1a3347 KVM: mmu: change useless int return types to void
kvm_mmu initialization is mostly filling in function pointers, there is
no way for it to fail.  Clean up unused return values.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 15:44:02 +03:00
Paolo Bonzini
d8d173dab2 KVM: mmu: remove uninteresting MMU "new_cr3" callbacks
The new_cr3 MMU callback has been a wrapper for mmu_free_roots since commit
e676505 (KVM: MMU: Force cr3 reload with two dimensional paging on mov
cr3 emulation, 2012-07-08).

The commit message mentioned that "mmu_free_roots() is somewhat of an overkill,
but fixing that is more complicated and will be done after this minimal fix".
One year has passed, and no one really felt the need to do a different fix.
Wrap the call with a kvm_mmu_new_cr3 function for clarity, but remove the
callback.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 15:43:59 +03:00
Paolo Bonzini
206260941f KVM: mmu: remove uninteresting MMU "free" callbacks
The free MMU callback has been a wrapper for mmu_free_roots since mmu_free_roots
itself was introduced (commit 17ac10a, [PATCH] KVM: MU: Special treatment
for shadow pae root pages, 2007-01-05), and has always been the same for all
MMU cases.  Remove the indirection as it is useless.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 15:43:56 +03:00
Paolo Bonzini
4344ee981e KVM: x86: only copy XSAVE state for the supported features
This makes the interface more deterministic for userspace, which can expect
(after configuring only the features it supports) to get exactly the same
state from the kernel, independent of the host CPU and kernel version.

Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 12:29:09 +03:00
Paolo Bonzini
d7876f1be4 KVM: x86: prevent setting unsupported XSAVE states
A guest can still attempt to save and restore XSAVE states even if they
have been masked in CPUID leaf 0Dh.  This usually is not visible to
the guest, but is still wrong: "Any attempt to set a reserved bit (as
determined by the contents of EAX and EDX after executing CPUID with
EAX=0DH, ECX= 0H) in XCR0 for a given processor will result in a #GP
exception".

The patch also performs the same checks as __kvm_set_xcr in KVM_SET_XSAVE.
This catches migration from newer to older kernel/processor before the
guest starts running.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 12:29:07 +03:00
Srivatsa Vaddagiri
6aef266c6e kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
kvm_hc_kick_cpu allows the calling vcpu to kick another vcpu out of halt state.
the presence of these hypercalls is indicated to guest via
kvm_feature_pv_unhalt.

Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration
During migration, any vcpu that got kicked but did not become runnable
(still in halted state) should be runnable after migration.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
[Raghu: Apic related changes, folding pvunhalted into vcpu_runnable
 Added flags for future use (suggested by Gleb)]
[ Raghu: fold pv_unhalt flag as suggested by Eric Northup]
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-08-26 12:47:09 +03:00
Yang Zhang
25d92081ae nEPT: Add nEPT violation/misconfigration support
Inject nEPT fault to L1 guest. This patch is original from Xinhao.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:40 +02:00
Paolo Bonzini
ac0a48c39a KVM: x86: rename EMULATE_DO_MMIO
The next patch will reuse it for other userspace exits than MMIO,
namely debug events.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-29 09:01:14 +02:00
Andi Kleen
103af0a987 perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5
[KVM maintainers:
The underlying support for this is in perf/core now. So please merge
this patch into the KVM tree.]

This is not arch perfmon, but older CPUs will just ignore it. This makes
it possible to do at least some TSX measurements from a KVM guest

v2: Various fixes to address review feedback
v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits.
v4: Use reserved bits for #GP
v5: Remove obsolete argument
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-19 18:24:45 +02:00
Linus Torvalds
fe489bf450 KVM fixes for 3.11
On the x86 side, there are some optimizations and documentation updates.
 The big ARM/KVM change for 3.11, support for AArch64, will come through
 Catalin Marinas's tree.  s390 and PPC have misc cleanups and bugfixes.
 
 There is a conflict due to "s390/pgtable: fix ipte notify bit" having
 entered 3.10 through Martin Schwidefsky's s390 tree.  This pull request
 has additional changes on top, so this tree's version is the correct one.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0oU6AAoJEBvWZb6bTYbynnsP/RSUrrHrA8Wu1tqVfAKu+1y5
 6OIihqZ9x11/YMaNofAfv86jqxFu0/j7CzMGphNdjzujqKI+Q1tGe7oiVCmKzoG+
 UvSctWsz0lpllgBtnnrm5tcfmG6rrddhLtpA7m320+xCVx8KV5P4VfyHZEU+Ho8h
 ziPmb2mAQ65gBNX6nLHEJ3ITTgad6gt4NNbrKIYpyXuWZQJypzaRqT/vpc4md+Ed
 dCebMXsL1xgyb98EcnOdrWH1wV30MfucR7IpObOhXnnMKeeltqAQPvaOlKzZh4dK
 +QfxJfdRZVS0cepcxzx1Q2X3dgjoKQsHq1nlIyz3qu1vhtfaqBlixLZk0SguZ/R9
 1S1YqucZiLRO57RD4q0Ak5oxwobu18ZoqJZ6nledNdWwDe8bz/W2wGAeVty19ky0
 qstBdM9jnwXrc0qrVgZp3+s5dsx3NAm/KKZBoq4sXiDLd/yBzdEdWIVkIrU3X9wU
 3X26wOmBxtsB7so/JR7ciTsQHelmLicnVeXohAEP9CjIJffB81xVXnXs0P0SYuiQ
 RzbSCwjPzET4JBOaHWT0Dhv0DTS/EaI97KzlN32US3Bn3WiLlS1oDCoPFoaLqd2K
 LxQMsXS8anAWxFvexfSuUpbJGPnKSidSQoQmJeMGBa9QhmZCht3IL16/Fb641ToN
 xBohzi49L9FDbpOnTYfz
 =1zpG
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "On the x86 side, there are some optimizations and documentation
  updates.  The big ARM/KVM change for 3.11, support for AArch64, will
  come through Catalin Marinas's tree.  s390 and PPC have misc cleanups
  and bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
  KVM: PPC: Ignore PIR writes
  KVM: PPC: Book3S PR: Invalidate SLB entries properly
  KVM: PPC: Book3S PR: Allow guest to use 1TB segments
  KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
  KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
  KVM: PPC: Book3S PR: Fix proto-VSID calculations
  KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
  KVM: Fix RTC interrupt coalescing tracking
  kvm: Add a tracepoint write_tsc_offset
  KVM: MMU: Inform users of mmio generation wraparound
  KVM: MMU: document fast invalidate all mmio sptes
  KVM: MMU: document fast invalidate all pages
  KVM: MMU: document fast page fault
  KVM: MMU: document mmio page fault
  KVM: MMU: document write_flooding_count
  KVM: MMU: document clear_spte_count
  KVM: MMU: drop kvm_mmu_zap_mmio_sptes
  KVM: MMU: init kvm generation close to mmio wrap-around value
  KVM: MMU: add tracepoint for check_mmio_spte
  KVM: MMU: fast invalidate all mmio sptes
  ...
2013-07-03 13:21:40 -07:00
Xiao Guangrong
f6f8adeef5 KVM: MMU: document fast invalidate all pages
Document it to Documentation/virtual/kvm/mmu.txt

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:47 +03:00
Xiao Guangrong
0cbf8e437b KVM: MMU: document write_flooding_count
Document write_flooding_count to Documentation/virtual/kvm/mmu.txt

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:43 +03:00
Xiao Guangrong
accaefe07d KVM: MMU: document clear_spte_count
Document it to Documentation/virtual/kvm/mmu.txt

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:42 +03:00
Xiao Guangrong
a8eca9dcc6 KVM: MMU: drop kvm_mmu_zap_mmio_sptes
Drop kvm_mmu_zap_mmio_sptes and use kvm_mmu_invalidate_zap_all_pages
instead to handle mmio generation number overflow

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:40 +03:00
Xiao Guangrong
f8f559422b KVM: MMU: fast invalidate all mmio sptes
This patch tries to introduce a very simple and scale way to invalidate
all mmio sptes - it need not walk any shadow pages and hold mmu-lock

KVM maintains a global mmio valid generation-number which is stored in
kvm->memslots.generation and every mmio spte stores the current global
generation-number into his available bits when it is created

When KVM need zap all mmio sptes, it just simply increase the global
generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
then it walks the shadow page table and get the mmio spte. If the
generation-number on the spte does not equal the global generation-number,
it will go to the normal #PF handler to update the mmio spte

Since 19 bits are used to store generation-number on mmio spte, we zap all
mmio sptes when the number is round

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:36 +03:00
H. Peter Anvin
afcbf13fa6 x86: Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE
Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE to match the SDM.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Link: http://lkml.kernel.org/n/tip-buq1evi5dpykxx7ak6amaam0@git.kernel.org
2013-06-25 16:26:06 -07:00
Xiao Guangrong
365c886860 KVM: MMU: reclaim the zapped-obsolete page first
As Marcelo pointed out that
| "(retention of large number of pages while zapping)
| can be fatal, it can lead to OOM and host crash"

We introduce a list, kvm->arch.zapped_obsolete_pages, to link all
the pages which are deleted from the mmu cache but not actually
freed. When page reclaiming is needed, we always zap this kind of
pages first.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:33:33 +03:00
Xiao Guangrong
5304b8d37c KVM: MMU: fast invalidate all pages
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability

In this patch, we introduce a faster way to invalidate all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.
Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are zapped by using lock-break technique

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:32:33 +03:00
Jan Kiszka
03b28f8133 KVM: x86: Account for failing enable_irq_window for NMI window request
With VMX, enable_irq_window can now return -EBUSY, in which case an
immediate exit shall be requested before entering the guest. Account for
this also in enable_nmi_window which uses enable_irq_window in absence
of vnmi support, e.g.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-05-02 22:17:38 -03:00
Chegu Vinod
cbf6435858 KVM: x86: Increase the "hard" max VCPU limit
KVM guests today use 8bit APIC ids allowing for 256 ID's. Reserving one
ID for Broadcast interrupts should leave 255 ID's. In case of KVM there
is no need for reserving another ID for IO-APIC so the hard max limit for
VCPUS can be increased from 254 to 255. (This was confirmed by Gleb Natapov
http://article.gmane.org/gmane.comp.emulators.kvm.devel/99713  )

Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 13:17:48 +03:00
Gleb Natapov
064d1afaa5 Merge git://github.com/agraf/linux-2.6.git kvm-ppc-next into queue 2013-04-28 12:50:07 +03:00
Jan Kiszka
730dca42c1 KVM: x86: Rework request for immediate exit
The VMX implementation of enable_irq_window raised
KVM_REQ_IMMEDIATE_EXIT after we checked it in vcpu_enter_guest. This
caused infinite loops on vmentry. Fix it by letting enable_irq_window
signal the need for an immediate exit via its return value and drop
KVM_REQ_IMMEDIATE_EXIT.

This issue only affects nested VMX scenarios.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 12:44:18 +03:00
Alexander Graf
8175e5b79c KVM: Add KVM_IRQCHIP_NUM_PINS in addition to KVM_IOAPIC_NUM_PINS
The concept of routing interrupt lines to an irqchip is nothing
that is IOAPIC specific. Every irqchip has a maximum number of pins
that can be linked to irq lines.

So let's add a new define that allows us to reuse generic code for
non-IOAPIC platforms.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2013-04-26 20:27:13 +02:00
Jan Kiszka
384bb78327 KVM: nVMX: Validate EFER values for VM_ENTRY/EXIT_LOAD_IA32_EFER
As we may emulate the loading of EFER on VM-entry and VM-exit, implement
the checks that VMX performs on the guest and host values on vmlaunch/
vmresume. Factor out kvm_valid_efer for this purpose which checks for
set reserved bits.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 12:53:42 +03:00
Yang Zhang
a20ed54d6e KVM: VMX: Add the deliver posted interrupt algorithm
Only deliver the posted interrupt when target vcpu is running
and there is no previous interrupt pending in pir.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
a547c6db4d KVM: VMX: Enable acknowledge interupt on vmexit
The "acknowledge interrupt on exit" feature controls processor behavior
for external interrupt acknowledgement. When this control is set, the
processor acknowledges the interrupt controller to acquire the
interrupt vector on VM exit.

After enabling this feature, an interrupt which arrived when target cpu is
running in vmx non-root mode will be handled by vmx handler instead of handler
in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt
table to let real handler to handle it. Further, we will recognize the interrupt
and only delivery the interrupt which not belong to current vcpu through idt table.
The interrupt which belonged to current vcpu will be handled inside vmx handler.
This will reduce the interrupt handle cost of KVM.

Also, interrupt enable logic is changed if this feature is turnning on:
Before this patch, hypervior call local_irq_enable() to enable it directly.
Now IF bit is set on interrupt stack frame, and will be enabled on a return from
interrupt handler if exterrupt interrupt exists. If no external interrupt, still
call local_irq_enable() to enable it.

Refer to Intel SDM volum 3, chapter 33.2.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:39 -03:00
Gleb Natapov
991eebf9f8 KVM: VMX: do not try to reexecute failed instruction while emulating invalid guest state
During invalid guest state emulation vcpu cannot enter guest mode to try
to reexecute instruction that emulator failed to emulate, so emulation
will happen again and again.  Prevent that by telling the emulator that
instruction reexecution should not be attempted.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:17 +03:00
Geoff Levand
8b415dcd76 KVM: Move kvm_rebooting declaration out of x86
The variable kvm_rebooting is a common kvm variable, so move its
declaration from arch/x86/include/asm/kvm_host.h to
include/asm/kvm_host.h.

Fixes this sparse warning when building on arm64:

  virt/kvm/kvm_main.c⚠️ symbol 'kvm_rebooting' was not declared. Should it be static?

Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-08 13:02:09 +03:00
Geoff Levand
fc1b74925f KVM: Move vm_list kvm_lock declarations out of x86
The variables vm_list and kvm_lock are common to all architectures, so
move the declarations from arch/x86/include/asm/kvm_host.h to
include/linux/kvm_host.h.

Fixes sparse warnings like these when building for arm64:

  virt/kvm/kvm_main.c: warning: symbol 'kvm_lock' was not declared. Should it be static?
  virt/kvm/kvm_main.c: warning: symbol 'vm_list' was not declared. Should it be static?

Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-08 13:02:00 +03:00
Paolo Bonzini
afd80d85ae pmu: prepare for migration support
In order to migrate the PMU state correctly, we need to restore the
values of MSR_CORE_PERF_GLOBAL_STATUS (a read-only register) and
MSR_CORE_PERF_GLOBAL_OVF_CTRL (which has side effects when written).
We also need to write the full 40-bit value of the performance counter,
which would only be possible with a v3 architectural PMU's full-width
counter MSRs.

To distinguish host-initiated writes from the guest's, pass the
full struct msr_data to kvm_pmu_set_msr.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-02 17:42:44 +03:00
Marcelo Tosatti
2ae33b3896 Merge remote-tracking branch 'upstream/master' into queue
Merge reason:

From: Alexander Graf <agraf@suse.de>

"Just recently this really important patch got pulled into Linus' tree for 3.9:

commit 1674400aae
Author: Anton Blanchard <anton <at> samba.org>
Date:   Tue Mar 12 01:51:51 2013 +0000

Without that commit, I can not boot my G5, thus I can't run automated tests on it against my queue.

Could you please merge kvm/next against linus/master, so that I can base my trees against that?"

* upstream/master: (653 commits)
  PCI: Use ROM images from firmware only if no other ROM source available
  sparc: remove unused "config BITS"
  sparc: delete "if !ULTRA_HAS_POPULATION_COUNT"
  KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798)
  KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797)
  KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796)
  arm64: Kconfig.debug: Remove unused CONFIG_DEBUG_ERRORS
  arm64: Do not select GENERIC_HARDIRQS_NO_DEPRECATED
  inet: limit length of fragment queue hash table bucket lists
  qeth: Fix scatter-gather regression
  qeth: Fix invalid router settings handling
  qeth: delay feature trace
  sgy-cts1000: Remove __dev* attributes
  KVM: x86: fix deadlock in clock-in-progress request handling
  KVM: allow host header to be included even for !CONFIG_KVM
  hwmon: (lm75) Fix tcn75 prefix
  hwmon: (lm75.h) Update header inclusion
  MAINTAINERS: Remove Mark M. Hoffman
  xfs: ensure we capture IO errors correctly
  xfs: fix xfs_iomap_eof_prealloc_initial_size type
  ...

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-21 11:11:52 -03:00
Andy Honig
0b79459b48 KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797)
There is a potential use after free issue with the handling of
MSR_KVM_SYSTEM_TIME.  If the guest specifies a GPA in a movable or removable
memory such as frame buffers then KVM might continue to write to that
address even after it's removed via KVM_SET_USER_MEMORY_REGION.  KVM pins
the page in memory so it's unlikely to cause an issue, but if the user
space component re-purposes the memory previously used for the guest, then
the guest will be able to corrupt that memory.

Tested: Tested against kvmclock unit test

Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-19 14:17:35 -03:00
Takuya Yoshikawa
982b3394dd KVM: x86: Optimize mmio spte zapping when creating/moving memslot
When we create or move a memory slot, we need to zap mmio sptes.
Currently, zap_all() is used for this and this is causing two problems:
 - extra page faults after zapping mmu pages
 - long mmu_lock hold time during zapping mmu pages

For the latter, Marcelo reported a disastrous mmu_lock hold time during
hot-plug, which made the guest unresponsive for a long time.

This patch takes a simple way to fix these problems: do not zap mmu
pages unless they are marked mmio cached.  On our test box, this took
only 50us for the 4GB guest and we did not see ms of mmu_lock hold time
any more.

Note that we still need to do zap_all() for other cases.  So another
work is also needed: Xiao's work may be the one.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-14 10:21:21 +02:00
Takuya Yoshikawa
95b0430d1a KVM: MMU: Mark sp mmio cached when creating mmio spte
This will be used not to zap unrelated mmu pages when creating/moving
a memory slot later.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-14 10:21:10 +02:00
Jan Kiszka
66450a21f9 KVM: x86: Rework INIT and SIPI handling
A VCPU sending INIT or SIPI to some other VCPU races for setting the
remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED
was overwritten by kvm_emulate_halt and, thus, got lost.

This introduces APIC events for those two signals, keeping them in
kvm_apic until kvm_apic_accept_events is run over the target vcpu
context. kvm_apic_has_events reports to kvm_arch_vcpu_runnable if there
are pending events, thus if vcpu blocking should end.

The patch comes with the side effect of effectively obsoleting
KVM_MP_STATE_SIPI_RECEIVED. We still accept it from user space, but
immediately translate it to KVM_MP_STATE_INIT_RECEIVED + KVM_APIC_SIPI.
The vcpu itself will no longer enter the KVM_MP_STATE_SIPI_RECEIVED
state. That also means we no longer exit to user space after receiving a
SIPI event.

Furthermore, we already reset the VCPU on INIT, only fixing up the code
segment later on when SIPI arrives. Moreover, we fix INIT handling for
the BSP: it never enter wait-for-SIPI but directly starts over on INIT.

Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-13 16:08:10 +02:00
Jan Kiszka
57f252f229 KVM: x86: Drop unused return code from VCPU reset callback
Neither vmx nor svm nor the common part may generate an error on
kvm_vcpu_reset. So drop the return code.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-12 13:25:56 +02:00
Yang Zhang
c7c9c56ca2 x86, apicv: add virtual interrupt delivery support
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:48:19 +02:00
Yang Zhang
8d14695f95 x86, apicv: add virtual x2apic support
basically to benefit from apicv, we need to enable virtualized x2apic mode.
Currently, we only enable it when guest is really using x2apic.

Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled x2apic:
0x800 - 0x8ff: no read intercept for apicv register virtualization,
               except APIC ID and TMCCT which need software's assistance to
               get right value.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:48:06 +02:00
Xiao Guangrong
93c05d3ef2 KVM: x86: improve reexecute_instruction
The current reexecute_instruction can not well detect the failed instruction
emulation. It allows guest to retry all the instructions except it accesses
on error pfn

For example, some cases are nested-write-protect - if the page we want to
write is used as PDE but it chains to itself. Under this case, we should
stop the emulation and report the case to userspace

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-21 22:58:33 -02:00
Takuya Yoshikawa
e12091ce7b KVM: Remove unused slot_bitmap from kvm_mmu_page
Not needed any more.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:13:58 +02:00
Alex Williamson
0f888f5acd KVM: Increase user memory slots on x86 to 125
With the 3 private slots, this gives us a nice round 128 slots total.
The primary motivation for this is to support more assigned devices.
Each assigned device can theoretically use up to 8 slots (6 MMIO BARs,
1 ROM BAR, 1 spare for a split MSI-X table mapping) though it's far
more typical for a device to use 3-4 slots.  If we assume a typical VM
uses a dozen slots for non-assigned devices purposes, we should always
be able to support 14 worst case assigned devices or 28 to 37 typical
devices.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:25:25 -02:00
Alex Williamson
0743247fbf KVM: Make KVM_PRIVATE_MEM_SLOTS optional
Seems like everyone copied x86 and defined 4 private memory slots
that never actually get used.  Even x86 only uses 3 of the 4.  These
aren't exposed so there's no need to add padding.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:21:58 -02:00
Alex Williamson
bbacc0c111 KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS
It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM.  One is
the user accessible slots and the other is user + private.  Make this
more obvious.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:21:57 -02:00
Will Auld
ba904635d4 KVM: x86: Emulate IA32_TSC_ADJUST MSR
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the "use TSC counter offsetting" VM-execution control
is enabled as well as the IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST +
vmsc tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one
of these three locations. The argument against storing it in the actual MSR
is performance. This is likely to be seldom used while the save/restore is
required on every transition. IA32_TSC_ADJUST was created as a way to solve
some issues with writing TSC itself so that is not an option either.

The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:

The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get
TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including
IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics
as seen by the guest do not and hence this will not cause a problem.

Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-30 18:29:30 -02:00
Will Auld
8fe8ab46be KVM: x86: Add code to track call origin for msr assignment
In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST msr
while setting the tsc value. It is anticipated that this capability is
useful for other tasks.

Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-30 18:26:12 -02:00
Marcelo Tosatti
b48aa97e38 KVM: x86: require matched TSC offsets for master clock
With master clock, a pvclock clock read calculates:

ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]

Where 'rdtsc' is the host TSC.

system_timestamp and tsc_timestamp are unique, one tuple
per VM: the "master clock".

Given a host with synchronized TSCs, its obvious that
guest TSC must be matched for the above to guarantee monotonicity.

Allow master clock usage only if guest TSCs are synchronized.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:15 -02:00
Marcelo Tosatti
d828199e84 KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
KVM added a global variable to guarantee monotonicity in the guest.
One of the reasons for that is that the time between

	1. ktime_get_ts(&timespec);
	2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances
executing in different VCPUS (cache misses, interrupts...).

If the TSC value that is used by the host to interpolate when
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single <system_timestamp, tsc_timestamp> tuple is visible to all
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.

Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs.

Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:13 -02:00
Marcelo Tosatti
886b470cb1 KVM: x86: pass host_tsc to read_l1_tsc
Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:11 -02:00
Linus Torvalds
ecefbd94b8 KVM updates for the 3.7 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h
 WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+
 R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3
 eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e
 StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z
 VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO
 VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M
 TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE
 rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO
 B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj
 0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7
 lb/nbAsXf9UJLgGir4I1
 =kN6V
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "Highlights of the changes for this release include support for vfio
  level triggered interrupts, improved big real mode support on older
  Intels, a streamlines guest page table walker, guest APIC speedups,
  PIO optimizations, better overcommit handling, and read-only memory."

* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
  KVM: s390: Fix vcpu_load handling in interrupt code
  KVM: x86: Fix guest debug across vcpu INIT reset
  KVM: Add resampling irqfds for level triggered interrupts
  KVM: optimize apic interrupt delivery
  KVM: MMU: Eliminate pointless temporary 'ac'
  KVM: MMU: Avoid access/dirty update loop if all is well
  KVM: MMU: Eliminate eperm temporary
  KVM: MMU: Optimize is_last_gpte()
  KVM: MMU: Simplify walk_addr_generic() loop
  KVM: MMU: Optimize pte permission checks
  KVM: MMU: Update accessed and dirty bits after guest pagetable walk
  KVM: MMU: Move gpte_access() out of paging_tmpl.h
  KVM: MMU: Optimize gpte_access() slightly
  KVM: MMU: Push clean gpte write protection out of gpte_access()
  KVM: clarify kvmclock documentation
  KVM: make processes waiting on vcpu mutex killable
  KVM: SVM: Make use of asm.h
  KVM: VMX: Make use of asm.h
  KVM: VMX: Make lto-friendly
  KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
  ...

Conflicts:
	arch/s390/include/asm/processor.h
	arch/x86/kvm/i8259.c
2012-10-04 09:30:33 -07:00
Jan Kiszka
c863901075 KVM: x86: Fix guest debug across vcpu INIT reset
If we reset a vcpu on INIT, we so far overwrote dr7 as provided by
KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally.

Fix this by saving the dr7 used for guest debugging and calculating the
effective register value as well as switch_db_regs on any potential
change. This will change to focus of the set_guest_debug vendor op to
update_dp_bp_intercept.

Found while trying to stop on start_secondary.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-23 15:00:07 +02:00
Xiao Guangrong
26bf264e87 KVM: x86: Export svm/vmx exit code and vector code to userspace
Exporting KVM exit information to userspace to be consumed by perf.

Signed-off-by: Dong Hao <haodong@linux.vnet.ibm.com>
[ Dong Hao <haodong@linux.vnet.ibm.com>: rebase it on acme's git tree ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Runzhen Wang <runzhen@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1347870675-31495-2-git-send-email-haodong@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2012-09-21 12:48:09 -03:00
Gleb Natapov
1e08ec4a13 KVM: optimize apic interrupt delivery
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 15:05:26 +03:00
Avi Kivity
6fd01b711b KVM: MMU: Optimize is_last_gpte()
Instead of branchy code depending on level, gpte.ps, and mmu configuration,
prepare everything in a bitmap during mode changes and look it up during
runtime.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:09 +03:00
Avi Kivity
97d64b7881 KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup.  It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.

Optimize this away by precalculating all variants and storing them in a
bitmap.  The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).

The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.

The result is short, branch-free code.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:08 +03:00
Gleb Natapov
716d51abff KVM: Provide userspace IO exit completion callback
Current code assumes that IO exit was due to instruction emulation
and handles execution back to emulator directly. This patch adds new
userspace IO exit completion callback that can be set by any other code
that caused IO exit to userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 18:06:37 +03:00
Marcelo Tosatti
51d59c6b42 KVM: x86: fix pvclock guest stopped flag reporting
kvm_guest_time_update unconditionally clears hv_clock.flags field,
so the notification never reaches the guest.

Fix it by allowing PVCLOCK_GUEST_STOPPED to passthrough.

Reviewed-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-13 16:10:45 -03:00
Takuya Yoshikawa
d89cc617b9 KVM: Push rmap into kvm_arch_memory_slot
Two reasons:
 - x86 can integrate rmap and rmap_pde and remove heuristics in
   __gfn_to_rmap().
 - Some architectures do not need rmap.

Since rmap is one of the most memory consuming stuff in KVM, ppc'd
better restrict the allocation to Book3S HV.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 12:47:30 +03:00
Avi Kivity
e9bda6f6f9 Merge branch 'queue' into next
Merge patches queued during the run-up to the merge window.

* queue: (25 commits)
  KVM: Choose better candidate for directed yield
  KVM: Note down when cpu relax intercepted or pause loop exited
  KVM: Add config to support ple or cpu relax optimzation
  KVM: switch to symbolic name for irq_states size
  KVM: x86: Fix typos in pmu.c
  KVM: x86: Fix typos in lapic.c
  KVM: x86: Fix typos in cpuid.c
  KVM: x86: Fix typos in emulate.c
  KVM: x86: Fix typos in x86.c
  KVM: SVM: Fix typos
  KVM: VMX: Fix typos
  KVM: remove the unused parameter of gfn_to_pfn_memslot
  KVM: remove is_error_hpa
  KVM: make bad_pfn static to kvm_main.c
  KVM: using get_fault_pfn to get the fault pfn
  KVM: MMU: track the refcount when unmap the page
  KVM: x86: remove unnecessary mark_page_dirty
  KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()
  KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()
  KVM: MMU: Add memslot parameter to hva handlers
  ...

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 11:54:21 +03:00
Linus Torvalds
5fecc9d8f5 KVM updates for the 3.6 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQDRDNAAoJEI7yEDeUysxlkl8P/3C2AHx2webOU8sVzhfU6ONZ
 ZoGevwBjyZIeJEmiWVpFTTEew1l0PXtpyOocXGNUXIddVnhXTQOKr/Scj4uFbmx8
 ROqgK8NSX9+xOGrBPCoN7SlJkmp+m6uYtwYkl2SGnsEVLWMKkc7J7oqmszCcTQvN
 UXMf7G47/Ul2NUSBdv4Yvizhl4kpvWxluiweDw3E/hIQKN0uyP7CY58qcAztw8nG
 csZBAnnuPFwIAWxHXW3eBBv4UP138HbNDqJ/dujjocM6GnOxmXJmcZ6b57gh+Y64
 3+w9IR4qrRWnsErb/I8inKLJ1Jdcf7yV2FmxYqR4pIXay2Yzo1BsvFd6EB+JavUv
 pJpixrFiDDFoQyXlh4tGpsjpqdXNMLqyG4YpqzSZ46C8naVv9gKE7SXqlXnjyDlb
 Llx3hb9Fop8O5ykYEGHi+gIISAK5eETiQl4yw9RUBDpxydH4qJtqGIbLiDy8y9wi
 Xyi8PBlNl+biJFsK805lxURqTp/SJTC3+Zb7A7CzYEQm5xZw3W/CKZx1ZYBfpaa/
 pWaP6tB7JwgLIVXi4HQayLWqMVwH0soZIn9yazpOEFv6qO8d5QH5RAxAW2VXE3n5
 JDlrajar/lGIdiBVWfwTJLb86gv3QDZtIWoR9mZuLKeKWE/6PRLe7HQpG1pJovsm
 2AsN5bS0BWq+aqPpZHa5
 =pECD
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "Highlights include
  - full big real mode emulation on pre-Westmere Intel hosts (can be
    disabled with emulate_invalid_guest_state=0)
  - relatively small ppc and s390 updates
  - PCID/INVPCID support in guests
  - EOI avoidance; 3.6 guests should perform better on 3.6 hosts on
    interrupt intensive workloads)
  - Lockless write faults during live migration
  - EPT accessed/dirty bits support for new Intel processors"

Fix up conflicts in:
 - Documentation/virtual/kvm/api.txt:

   Stupid subchapter numbering, added next to each other.

 - arch/powerpc/kvm/booke_interrupts.S:

   PPC asm changes clashing with the KVM fixes

 - arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c:

   Duplicated commits through the kvm tree and the s390 tree, with
   subsequent edits in the KVM tree.

* tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
  KVM: fix race with level interrupts
  x86, hyper: fix build with !CONFIG_KVM_GUEST
  Revert "apic: fix kvm build on UP without IOAPIC"
  KVM guest: switch to apic_set_eoi_write, apic_write
  apic: add apic_set_eoi_write for PV use
  KVM: VMX: Implement PCID/INVPCID for guests with EPT
  KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check
  KVM: PPC: Critical interrupt emulation support
  KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests
  KVM: PPC64: booke: Set interrupt computation mode for 64-bit host
  KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt
  KVM: PPC: bookehv64: Add support for std/ld emulation.
  booke: Added crit/mc exception handler for e500v2
  booke/bookehv: Add host crit-watchdog exception support
  KVM: MMU: document mmu-lock and fast page fault
  KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
  KVM: MMU: trace fast page fault
  KVM: MMU: fast path of handling guest page fault
  KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
  KVM: MMU: fold tlb flush judgement into mmu_spte_update
  ...
2012-07-24 12:01:20 -07:00
Michael S. Tsirkin
1a577b7247 KVM: fix race with level interrupts
When more than 1 source id is in use for the same GSI, we have the
following race related to handling irq_states race:

CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1.
CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0).
Now ioapic thinks the level is 0 but irq_state is not 0.

Fix by performing all irq_states bitmap handling under pic/ioapic lock.
This also removes the need for atomics with irq_states handling.

Reported-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-20 16:12:00 -03:00
Takuya Yoshikawa
77d11309b3 KVM: Separate rmap_pde from kvm_lpage_info->write_count
This makes it possible to loop over rmap_pde arrays in the same way as
we do over rmap so that we can optimize kvm_handle_hva_range() easily in
the following patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Takuya Yoshikawa
b3ae209697 KVM: Introduce kvm_unmap_hva_range() for kvm_mmu_notifier_invalidate_range_start()
When we tested KVM under memory pressure, with THP enabled on the host,
we noticed that MMU notifier took a long time to invalidate huge pages.

Since the invalidation was done with mmu_lock held, it not only wasted
the CPU but also made the host harder to respond.

This patch mitigates this by using kvm_handle_hva_range().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Mao, Junjie
ad756a1603 KVM: VMX: Implement PCID/INVPCID for guests with EPT
This patch handles PCID/INVPCID for guests.

Process-context identifiers (PCIDs) are a facility by which a logical processor
may cache information for multiple linear-address spaces so that the processor
may retain cached information when software switches to a different linear
address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
Volume 3A for details.

For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
natively.
For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-12 13:07:34 +03:00
Robert Richter
15c7ad51ad perf/x86: Rename Intel specific macros
There are macros that are Intel specific and not x86 generic. Rename
them into INTEL_*.

This patch removes X86_PMC_IDX_GENERIC and does:

 $ sed -i -e 's/X86_PMC_MAX_/INTEL_PMC_MAX_/g'           \
         arch/x86/include/asm/kvm_host.h                 \
         arch/x86/include/asm/perf_event.h               \
         arch/x86/kernel/cpu/perf_event.c                \
         arch/x86/kernel/cpu/perf_event_p4.c             \
         arch/x86/kvm/pmu.c
 $ sed -i -e 's/X86_PMC_IDX_FIXED/INTEL_PMC_IDX_FIXED/g' \
         arch/x86/include/asm/perf_event.h               \
         arch/x86/kernel/cpu/perf_event.c                \
         arch/x86/kernel/cpu/perf_event_intel.c          \
         arch/x86/kernel/cpu/perf_event_intel_ds.c       \
         arch/x86/kvm/pmu.c
 $ sed -i -e 's/X86_PMC_MSK_/INTEL_PMC_MSK_/g'           \
         arch/x86/include/asm/perf_event.h               \
         arch/x86/kernel/cpu/perf_event.c

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:19:39 +02:00
Michael S. Tsirkin
ae7a2a3fb6 KVM: host side for eoi optimization
Implementation of PV EOI using shared memory.
This reduces the number of exits an interrupt
causes as much as by half.

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.

There's a new MSR to set the address of said register
in guest memory. Otherwise not much changed:
- Guest EOI is not required
- Register is tested & ISR is automatically cleared on exit

For testing results see description of previous patch
'kvm_para: guest side for eoi avoidance'.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-25 12:40:55 +03:00
Linus Torvalds
07acfc2a93 Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Avi Kivity:
 "Changes include additional instruction emulation, page-crossing MMIO,
  faster dirty logging, preventing the watchdog from killing a stopped
  guest, module autoload, a new MSI ABI, and some minor optimizations
  and fixes.  Outside x86 we have a small s390 and a very large ppc
  update.

  Regarding the new (for kvm) rebaseless workflow, some of the patches
  that were merged before we switch trees had to be rebased, while
  others are true pulls.  In either case the signoffs should be correct
  now."

Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h.

I suspect the kvm_para.h resolution ends up doing the "do I have cpuid"
check effectively twice (it was done differently in two different
commits), but better safe than sorry ;)

* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits)
  KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block
  KVM: s390: onereg for timer related registers
  KVM: s390: epoch difference and TOD programmable field
  KVM: s390: KVM_GET/SET_ONEREG for s390
  KVM: s390: add capability indicating COW support
  KVM: Fix mmu_reload() clash with nested vmx event injection
  KVM: MMU: Don't use RCU for lockless shadow walking
  KVM: VMX: Optimize %ds, %es reload
  KVM: VMX: Fix %ds/%es clobber
  KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()
  KVM: VMX: unlike vmcs on fail path
  KVM: PPC: Emulator: clean up SPR reads and writes
  KVM: PPC: Emulator: clean up instruction parsing
  kvm/powerpc: Add new ioctl to retreive server MMU infos
  kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM
  KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler
  KVM: PPC: Book3S: Enable IRQs during exit handling
  KVM: PPC: Fix PR KVM on POWER7 bare metal
  KVM: PPC: Fix stbux emulation
  KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields
  ...
2012-05-24 16:17:30 -07:00
Avi Kivity
c142786c62 KVM: MMU: Don't use RCU for lockless shadow walking
Using RCU for lockless shadow walking can increase the amount of memory
in use by the system, since RCU grace periods are unpredictable.  We also
have an unconditional write to a shared variable (reader_counter), which
isn't good for scaling.

Replace that with a scheme similar to x86's get_user_pages_fast(): disable
interrupts during lockless shadow walk to force the freer
(kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the
processor with interrupts enabled.

We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent
kvm_flush_remote_tlbs() from avoiding the IPI.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 16:08:28 -03:00
Gleb Natapov
4138377142 KVM: Introduce bitmask for apic attention reasons
The patch introduces a bitmap that will hold reasons apic should be
checked during vmexit. This is in a preparation for vp eoi patch
that will add one more check on vmexit. With the bitmap we can do
if(apic_attention) to check everything simultaneously which will
add zero overhead on the fast path.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-24 16:36:18 +03:00
H. Peter Anvin
3ee89722cf x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/kvm_host.h
Remove open-coded exception table entries in arch/x86/include/asm/kvm_host.h,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
2012-04-20 13:51:40 -07:00
Takuya Yoshikawa
5dc99b2380 KVM: Avoid checking huge page mappings in get_dirty_log()
Dropped such mappings when we enabled dirty logging and we will never
create new ones until we stop the logging.

For this we introduce a new function which can be used to write protect
a range of PT level pages: although we do not need to care about a range
of pages at this point, the following patch will need this feature to
optimize the write protection of many pages.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:49:58 +03:00
Kevin Wolf
7f3d35fddd KVM: x86 emulator: Fix task switch privilege checks
Currently, all task switches check privileges against the DPL of the
TSS. This is only correct for jmp/call to a TSS. If a task gate is used,
the DPL of this take gate is used for the check instead. Exceptions,
external interrupts and iret shouldn't perform any check.

[avi: kill kvm-kmod remnants]

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:26 +02:00
Takuya Yoshikawa
db3fe4eb45 KVM: Introduce kvm_memory_slot::arch and move lpage_info into it
Some members of kvm_memory_slot are not used by every architecture.

This patch is the first step to make this difference clear by
introducing kvm_memory_slot::arch;  lpage_info is moved into it.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:22 +02:00
Zachary Amsden
e26101b116 KVM: Track TSC synchronization in generations
This allows us to track the original nanosecond and counter values
at each phase of TSC writing by the guest.  This gets us perfect
offset matching for stable TSC systems, and perfect software
computed TSC matching for machines with unstable TSC.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:09 +02:00
Zachary Amsden
0dd6a6edb0 KVM: Dont mark TSC unstable due to S4 suspend
During a host suspend, TSC may go backwards, which KVM interprets
as an unstable TSC.  Technically, KVM should not be marking the
TSC unstable, which causes the TSC clocksource to go bad, but we
need to be adjusting the TSC offsets in such a case.

Dealing with this issue is a little tricky as the only place we
can reliably do it is before much of the timekeeping infrastructure
is up and running.  On top of this, we are not in a KVM thread
context, so we may not be able to safely access VCPU fields.
Instead, we compute our best known hardware offset at power-up and
stash it to be applied to all VCPUs when they actually start running.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:08 +02:00
Marcelo Tosatti
f1e2b26003 KVM: Allow adjust_tsc_offset to be in host or guest cycles
Redefine the API to take a parameter indicating whether an
adjustment is in host or guest cycles.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:07 +02:00
Zachary Amsden
6f526ec538 KVM: Add last_host_tsc tracking back to KVM
The variable last_host_tsc was removed from upstream code.  I am adding
it back for two reasons.  First, it is unnecessary to use guest TSC
computation to conclude information about the host TSC.  The guest may
set the TSC backwards (this case handled by the previous patch), but
the computation of guest TSC (and fetching an MSR) is significanlty more
work and complexity than simply reading the hardware counter.  In addition,
we don't actually need the guest TSC for any part of the computation,
by always recomputing the offset, we can eliminate the need to deal with
the current offset and any scaling factors that may apply.

The second reason is that later on, we are going to be using the host
TSC value to restore TSC offsets after a host S4 suspend, so we need to
be reading the host values, not the guest values here.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:06 +02:00
Zachary Amsden
5d3cb0f6a8 KVM: Improve TSC offset matching
There are a few improvements that can be made to the TSC offset
matching code.  First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.

Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates.  Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.

Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.

[avi: fix 64-bit division on i386]

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: Fix 64-bit division in kvm_write_tsc()

Breaks i386 build.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:03 +02:00
Zachary Amsden
cc578287e3 KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate.  Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code.  This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.

There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed.  Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used.  The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.

In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.

[avi: fix 64-bit division on i386]

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:09:35 +02:00
Marcelo Tosatti
a59cb29e4d KVM: x86: increase recommended max vcpus to 160
Increase recommended max vcpus from 64 to 160 (tested internally
at Red Hat).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:57:34 +02:00
Takuya Yoshikawa
3ea8b75e47 KVM: MMU: Remove unused kvm_pte_chain
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:42 +02:00
Boris Ostrovsky
2b036c6b86 KVM: SVM: Add support for AMD's OSVW feature in guests
In some cases guests should not provide workarounds for errata even when the
physical processor is affected. For example, because of erratum 400 on family
10h processors a Linux guest will read an MSR (resulting in VMEXIT) before
going to idle in order to avoid getting stuck in a non-C0 state. This is not
necessary: HLT and IO instructions are intercepted and therefore there is no
reason for erratum 400 workaround in the guest.

This patch allows us to present a guest with certain errata as fixed,
regardless of the state of actual hardware.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:21 +02:00
Avi Kivity
022cd0e840 KVM: Add generic RDPMC support
Add a helper function that emulates the RDPMC instruction operation.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:35 +02:00
Gleb Natapov
f5132b0138 KVM: Expose a version 2 architectural PMU to a guests
Use perf_events to emulate an architectural PMU, version 2.

Based on PMU version 1 emulation by Avi Kivity.

[avi: adjust for cpuid.c]
[jan: fix anonymous field initialization for older gcc]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:29 +02:00
Xiao Guangrong
e459e3228d KVM: MMU: move the relevant mmu code to mmu.c
Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create()

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:09 +02:00
Xiao Guangrong
93a5cef07d KVM: introduce KVM_MEM_SLOTS_NUM macro
Introduce KVM_MEM_SLOTS_NUM macro to instead of
KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:34 +02:00
Takuya Yoshikawa
95d4c16ce7 KVM: Optimize dirty logging by rmap_write_protect()
Currently, write protecting a slot needs to walk all the shadow pages
and checks ones which have a pte mapping a page in it.

The walk is overly heavy when dirty pages in that slot are not so many
and checking the shadow pages would result in unwanted cache pollution.

To mitigate this problem, we use rmap_write_protect() and check only
the sptes which can be reached from gfns marked in the dirty bitmap
when the number of dirty pages are less than that of shadow pages.

This criterion is reasonable in its meaning and worked well in our test:
write protection became some times faster than before when the ratio of
dirty pages are low and was not worse even when the ratio was near the
criterion.

Note that the locking for this write protection becomes fine grained.
The reason why this is safe is descripted in the comments.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:20 +02:00
Chris Wright
fb92045843 KVM: MMU: remove KVM host pv mmu support
The host side pv mmu support has been marked for feature removal in
January 2011.  It's not in use, is slower than shadow or hardware
assisted paging, and a maintenance burden.  It's November 2011, time to
remove it.

Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:10 +02:00
Xiao Guangrong
a30f47cb15 KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:02 +02:00
Xiao Guangrong
f57f2ef58f KVM: MMU: fast prefetch spte on invlpg path
Fast prefetch spte for the unsync shadow page on invlpg path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:56 +02:00
Xiao Guangrong
d01f8d5e02 KVM: MMU: do not mark accessed bit on pte write path
In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:53 +02:00
Xiao Guangrong
1cb3f3ae5a KVM: x86: retry non-page-table writing instructions
If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly

The idea is from Avi

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:50 +02:00
Liu, Jinsong
a3e06bbe84 KVM: emulate lapic tsc deadline timer for guest
This patch emulate lapic tsc deadline timer for guest:
Enumerate tsc deadline timer capability by CPUID;
Enable tsc deadline timer mode by lapic MMIO;
Start tsc deadline timer by WRMSR;

[jan: use do_div()]
[avi: fix for !irqchip_in_kernel()]
[marcelo: another fix for !irqchip_in_kernel()]

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-10-05 15:34:56 +02:00
Avi Kivity
7460fb4a34 KVM: Fix simultaneous NMIs
If simultaneous NMIs happen, we're supposed to queue the second
and next (collapsing them), but currently we sometimes collapse
the second into the first.

Fix by using a counter for pending NMIs instead of a bool; since
the counter limit depends on whether the processor is currently
in an NMI handler, which can only be checked in vcpu context
(via the NMI mask), we add a new KVM_REQ_NMI to request recalculation
of the counter.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:59 +03:00
Nadav Har'El
d5c1785d2f KVM: L1 TSC handling
KVM assumed in several places that reading the TSC MSR returns the value for
L1. This is incorrect, because when L2 is running, the correct TSC read exit
emulation is to return L2's value.

We therefore add a new x86_ops function, read_l1_tsc, to use in places that
specifically need to read the L1 TSC, NOT the TSC of the current level of
guest.

Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
by a different patch sent by Zachary Amsden (and not yet applied):
kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().

[avi: moved callback to kvm_x86_ops tsc block]

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Acked-by: Zachary Amsdem <zamsden@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Avi Kivity
e4e517b4be KVM: MMU: Do not unconditionally read PDPTE from guest memory
Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded.
On SVM, it is not possible to implement this, but on VMX this is possible
and was indeed implemented until nested SVM changed this to unconditionally
read PDPTEs dynamically.  This has noticable impact when running PAE guests.

Fix by changing the MMU to read PDPTRs from the cache, falling back to
reading from memory for the nested MMU.

Signed-off-by: Avi Kivity <avi@redhat.com>
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:18:01 +03:00
Stefan Hajnoczi
0d460ffc09 KVM: Use __print_symbolic() for vmexit tracepoints
The vmexit tracepoints format the exit_reason to make it human-readable.
Since the exit_reason depends on the instruction set (vmx or svm),
formatting is handled with ftrace_print_symbols_seq() by referring to
the appropriate exit reason table.

However, the ftrace_print_symbols_seq() function is not meant to be used
directly in tracepoints since it does not export the formatting table
which userspace tools like trace-cmd and perf use to format traces.

In practice perf dies when formatting vmexit-related events and
trace-cmd falls back to printing the numeric value (with extra
formatting code in the kvm plugin to paper over this limitation).  Other
userspace consumers of vmexit-related tracepoints would be in similar
trouble.

To avoid significant changes to the kvm_exit tracepoint, this patch
moves the vmx and svm exit reason tables into arch/x86/kvm/trace.h and
selects the right table with __print_symbolic() depending on the
instruction set.  Note that __print_symbolic() is designed for exporting
the formatting table to userspace and allows trace-cmd and perf to work.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:59 +03:00
Sasha Levin
8c3ba334f8 KVM: x86: Raise the hard VCPU count limit
The patch raises the hard limit of VCPU count to 254.

This will allow developers to easily work on scalability
and will allow users to test high VCPU setups easily without
patching the kernel.

To prevent possible issues with current setups, KVM_CAP_NR_VCPUS
now returns the recommended VCPU limit (which is still 64) - this
should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS
returns the hard limit which is now 254.

Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Suggested-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:17:57 +03:00
Xiao Guangrong
c2a2ac2b56 KVM: MMU: lockless walking shadow page table
Use rcu to protect shadow pages table to be freed, so we can safely walk it,
it should run fastly and is needed by mmio page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:38 +03:00
Xiao Guangrong
c37079586f KVM: MMU: remove bypass_guest_pf
The idea is from Avi:
| Maybe it's time to kill off bypass_guest_pf=1.  It's not as effective as
| it used to be, since unsync pages always use shadow_trap_nonpresent_pte,
| and since we convert between the two nonpresent_ptes during sync and unsync.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:33 +03:00
Xiao Guangrong
bebb106a5a KVM: MMU: cache mmio info on page fault path
If the page fault is caused by mmio, we can cache the mmio info, later, we do
not need to walk guest page table and quickly know it is a mmio fault while we
emulate the mmio instruction

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:26 +03:00
Glauber Costa
c9aaa8957f KVM: Steal time implementation
To implement steal time, we need the hypervisor to pass the guest
information about how much time was spent running other processes
outside the VM, while the vcpu had meaningful work to do - halt
time does not count.

This information is acquired through the run_delay field of
delayacct/schedstats infrastructure, that counts time spent in a
runqueue but not running.

Steal time is a per-cpu information, so the traditional MSR-based
infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
memory area address containing information about steal time

This patch contains the hypervisor part of the steal time infrasructure,
and can be backported independently of the guest portion.

[avi, yongjie: export delayacct_on, to avoid build failures in some configs]

Signed-off-by: Glauber Costa <glommer@redhat.com>
Tested-by: Eric B Munson <emunson@mgebm.net>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Yongjie Ren <yongjie.ren@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-14 12:59:14 +03:00
Avi Kivity
411c588dfb KVM: MMU: Adjust shadow paging to work when SMEP=1 and CR0.WP=0
When CR0.WP=0, we sometimes map user pages as kernel pages (to allow
the kernel to write to them).  Unfortunately this also allows the kernel
to fetch from these pages, even if CR4.SMEP is set.

Adjust for this by also setting NX on the spte in these circumstances.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:26 +03:00
Yang, Wei
d9c3476d8a KVM: Remove RDWRGSFS bit from CR4_RESERVED_BITS
This patch removes RDWRGSFS bit from CR4_RESERVED_BITS.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:22 +03:00
Yang, Wei Y
8d9c975fc5 KVM: Remove SMEP bit from CR4_RESERVED_BITS
This patch removes SMEP bit from CR4_RESERVED_BITS.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Shan, Haitao <haitao.shan@intel.com>
Signed-off-by: Li, Xin <xin.li@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:12 +03:00
Nadav Har'El
5e1746d620 KVM: nVMX: Allow setting the VMXE bit in CR4
This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the nested VMX feature is
enabled, and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:10 +03:00
Xiao Guangrong
67052b3508 KVM: MMU: remove the arithmetic of parent pte rmap
Parent pte rmap and page rmap are very similar, so use the same arithmetic
for them

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:07 +03:00
Xiao Guangrong
53c07b1878 KVM: MMU: abstract the operation of rmap
Abstract the operation of rmap to spte_list, then we can use it for the
reverse mapping of parent pte in the later patch

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:06 +03:00
Xiao Guangrong
332b207d65 KVM: MMU: optimize pte write path if don't have protected sp
Simply return from kvm_mmu_pte_write path if no shadow page is
write-protected, then we can avoid to walk all shadow pages and hold
mmu-lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:02 +03:00
Avi Kivity
5e520e6278 KVM: VMX: Move VMREAD cleanup to exception handler
We clean up a failed VMREAD by clearing the output register.  Do
it in the exception handler instead of unconditionally.  This is
worthwhile since there are more than a hundred call sites.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:00 +03:00
Avi Kivity
2fb92db1ec KVM: VMX: Cache vmcs segment fields
Since the emulator now checks segment limits and access rights, it
generates a lot more accesses to the vmcs segment fields.  Undo some
of the performance hit by cacheing those fields in a read-only cache
(the entire cache is invalidated on any write, or on guest exit).

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:47:45 -04:00
Gleb Natapov
8d7d810255 KVM: mmio_fault_cr2 is not used
Remove unused variable mmio_fault_cr2.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-22 08:39:49 -04:00