mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	 e00b0ab86c
			
		
	
	
		e00b0ab86c
		
	
	
	
	
		
			
			There are 4 IRQ documentation files under Documentation/*.txt. Move them into a new directory (core-api/irq) and add a new index file for it. While here, use a title markup for the Debugging section of the irq-domain.rst file. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/2da7485c3718e1442e6b4c2dd66857b776e8899b.1588345503.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
		
			
				
	
	
		
			616 lines
		
	
	
		
			24 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			616 lines
		
	
	
		
			24 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| L1TF - L1 Terminal Fault
 | |
| ========================
 | |
| 
 | |
| L1 Terminal Fault is a hardware vulnerability which allows unprivileged
 | |
| speculative access to data which is available in the Level 1 Data Cache
 | |
| when the page table entry controlling the virtual address, which is used
 | |
| for the access, has the Present bit cleared or other reserved bits set.
 | |
| 
 | |
| Affected processors
 | |
| -------------------
 | |
| 
 | |
| This vulnerability affects a wide range of Intel processors. The
 | |
| vulnerability is not present on:
 | |
| 
 | |
|    - Processors from AMD, Centaur and other non Intel vendors
 | |
| 
 | |
|    - Older processor models, where the CPU family is < 6
 | |
| 
 | |
|    - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
 | |
|      Penwell, Pineview, Silvermont, Airmont, Merrifield)
 | |
| 
 | |
|    - The Intel XEON PHI family
 | |
| 
 | |
|    - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
 | |
|      IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
 | |
|      by the Meltdown vulnerability either. These CPUs should become
 | |
|      available by end of 2018.
 | |
| 
 | |
| Whether a processor is affected or not can be read out from the L1TF
 | |
| vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
 | |
| 
 | |
| Related CVEs
 | |
| ------------
 | |
| 
 | |
| The following CVE entries are related to the L1TF vulnerability:
 | |
| 
 | |
|    =============  =================  ==============================
 | |
|    CVE-2018-3615  L1 Terminal Fault  SGX related aspects
 | |
|    CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
 | |
|    CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
 | |
|    =============  =================  ==============================
 | |
| 
 | |
| Problem
 | |
| -------
 | |
| 
 | |
| If an instruction accesses a virtual address for which the relevant page
 | |
| table entry (PTE) has the Present bit cleared or other reserved bits set,
 | |
| then speculative execution ignores the invalid PTE and loads the referenced
 | |
| data if it is present in the Level 1 Data Cache, as if the page referenced
 | |
| by the address bits in the PTE was still present and accessible.
 | |
| 
 | |
| While this is a purely speculative mechanism and the instruction will raise
 | |
| a page fault when it is retired eventually, the pure act of loading the
 | |
| data and making it available to other speculative instructions opens up the
 | |
| opportunity for side channel attacks to unprivileged malicious code,
 | |
| similar to the Meltdown attack.
 | |
| 
 | |
| While Meltdown breaks the user space to kernel space protection, L1TF
 | |
| allows to attack any physical memory address in the system and the attack
 | |
| works across all protection domains. It allows an attack of SGX and also
 | |
| works from inside virtual machines because the speculation bypasses the
 | |
| extended page table (EPT) protection mechanism.
 | |
| 
 | |
| 
 | |
| Attack scenarios
 | |
| ----------------
 | |
| 
 | |
| 1. Malicious user space
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
|    Operating Systems store arbitrary information in the address bits of a
 | |
|    PTE which is marked non present. This allows a malicious user space
 | |
|    application to attack the physical memory to which these PTEs resolve.
 | |
|    In some cases user-space can maliciously influence the information
 | |
|    encoded in the address bits of the PTE, thus making attacks more
 | |
|    deterministic and more practical.
 | |
| 
 | |
|    The Linux kernel contains a mitigation for this attack vector, PTE
 | |
|    inversion, which is permanently enabled and has no performance
 | |
|    impact. The kernel ensures that the address bits of PTEs, which are not
 | |
|    marked present, never point to cacheable physical memory space.
 | |
| 
 | |
|    A system with an up to date kernel is protected against attacks from
 | |
|    malicious user space applications.
 | |
| 
 | |
| 2. Malicious guest in a virtual machine
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
|    The fact that L1TF breaks all domain protections allows malicious guest
 | |
|    OSes, which can control the PTEs directly, and malicious guest user
 | |
|    space applications, which run on an unprotected guest kernel lacking the
 | |
|    PTE inversion mitigation for L1TF, to attack physical host memory.
 | |
| 
 | |
|    A special aspect of L1TF in the context of virtualization is symmetric
 | |
|    multi threading (SMT). The Intel implementation of SMT is called
 | |
|    HyperThreading. The fact that Hyperthreads on the affected processors
 | |
|    share the L1 Data Cache (L1D) is important for this. As the flaw allows
 | |
|    only to attack data which is present in L1D, a malicious guest running
 | |
|    on one Hyperthread can attack the data which is brought into the L1D by
 | |
|    the context which runs on the sibling Hyperthread of the same physical
 | |
|    core. This context can be host OS, host user space or a different guest.
 | |
| 
 | |
|    If the processor does not support Extended Page Tables, the attack is
 | |
|    only possible, when the hypervisor does not sanitize the content of the
 | |
|    effective (shadow) page tables.
 | |
| 
 | |
|    While solutions exist to mitigate these attack vectors fully, these
 | |
|    mitigations are not enabled by default in the Linux kernel because they
 | |
|    can affect performance significantly. The kernel provides several
 | |
|    mechanisms which can be utilized to address the problem depending on the
 | |
|    deployment scenario. The mitigations, their protection scope and impact
 | |
|    are described in the next sections.
 | |
| 
 | |
|    The default mitigations and the rationale for choosing them are explained
 | |
|    at the end of this document. See :ref:`default_mitigations`.
 | |
| 
 | |
| .. _l1tf_sys_info:
 | |
| 
 | |
| L1TF system information
 | |
| -----------------------
 | |
| 
 | |
| The Linux kernel provides a sysfs interface to enumerate the current L1TF
 | |
| status of the system: whether the system is vulnerable, and which
 | |
| mitigations are active. The relevant sysfs file is:
 | |
| 
 | |
| /sys/devices/system/cpu/vulnerabilities/l1tf
 | |
| 
 | |
| The possible values in this file are:
 | |
| 
 | |
|   ===========================   ===============================
 | |
|   'Not affected'		The processor is not vulnerable
 | |
|   'Mitigation: PTE Inversion'	The host protection is active
 | |
|   ===========================   ===============================
 | |
| 
 | |
| If KVM/VMX is enabled and the processor is vulnerable then the following
 | |
| information is appended to the 'Mitigation: PTE Inversion' part:
 | |
| 
 | |
|   - SMT status:
 | |
| 
 | |
|     =====================  ================
 | |
|     'VMX: SMT vulnerable'  SMT is enabled
 | |
|     'VMX: SMT disabled'    SMT is disabled
 | |
|     =====================  ================
 | |
| 
 | |
|   - L1D Flush mode:
 | |
| 
 | |
|     ================================  ====================================
 | |
|     'L1D vulnerable'		      L1D flushing is disabled
 | |
| 
 | |
|     'L1D conditional cache flushes'   L1D flush is conditionally enabled
 | |
| 
 | |
|     'L1D cache flushes'		      L1D flush is unconditionally enabled
 | |
|     ================================  ====================================
 | |
| 
 | |
| The resulting grade of protection is discussed in the following sections.
 | |
| 
 | |
| 
 | |
| Host mitigation mechanism
 | |
| -------------------------
 | |
| 
 | |
| The kernel is unconditionally protected against L1TF attacks from malicious
 | |
| user space running on the host.
 | |
| 
 | |
| 
 | |
| Guest mitigation mechanisms
 | |
| ---------------------------
 | |
| 
 | |
| .. _l1d_flush:
 | |
| 
 | |
| 1. L1D flush on VMENTER
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
|    To make sure that a guest cannot attack data which is present in the L1D
 | |
|    the hypervisor flushes the L1D before entering the guest.
 | |
| 
 | |
|    Flushing the L1D evicts not only the data which should not be accessed
 | |
|    by a potentially malicious guest, it also flushes the guest
 | |
|    data. Flushing the L1D has a performance impact as the processor has to
 | |
|    bring the flushed guest data back into the L1D. Depending on the
 | |
|    frequency of VMEXIT/VMENTER and the type of computations in the guest
 | |
|    performance degradation in the range of 1% to 50% has been observed. For
 | |
|    scenarios where guest VMEXIT/VMENTER are rare the performance impact is
 | |
|    minimal. Virtio and mechanisms like posted interrupts are designed to
 | |
|    confine the VMEXITs to a bare minimum, but specific configurations and
 | |
|    application scenarios might still suffer from a high VMEXIT rate.
 | |
| 
 | |
|    The kernel provides two L1D flush modes:
 | |
|     - conditional ('cond')
 | |
|     - unconditional ('always')
 | |
| 
 | |
|    The conditional mode avoids L1D flushing after VMEXITs which execute
 | |
|    only audited code paths before the corresponding VMENTER. These code
 | |
|    paths have been verified that they cannot expose secrets or other
 | |
|    interesting data to an attacker, but they can leak information about the
 | |
|    address space layout of the hypervisor.
 | |
| 
 | |
|    Unconditional mode flushes L1D on all VMENTER invocations and provides
 | |
|    maximum protection. It has a higher overhead than the conditional
 | |
|    mode. The overhead cannot be quantified correctly as it depends on the
 | |
|    workload scenario and the resulting number of VMEXITs.
 | |
| 
 | |
|    The general recommendation is to enable L1D flush on VMENTER. The kernel
 | |
|    defaults to conditional mode on affected processors.
 | |
| 
 | |
|    **Note**, that L1D flush does not prevent the SMT problem because the
 | |
|    sibling thread will also bring back its data into the L1D which makes it
 | |
|    attackable again.
 | |
| 
 | |
|    L1D flush can be controlled by the administrator via the kernel command
 | |
|    line and sysfs control files. See :ref:`mitigation_control_command_line`
 | |
|    and :ref:`mitigation_control_kvm`.
 | |
| 
 | |
| .. _guest_confinement:
 | |
| 
 | |
| 2. Guest VCPU confinement to dedicated physical cores
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
|    To address the SMT problem, it is possible to make a guest or a group of
 | |
|    guests affine to one or more physical cores. The proper mechanism for
 | |
|    that is to utilize exclusive cpusets to ensure that no other guest or
 | |
|    host tasks can run on these cores.
 | |
| 
 | |
|    If only a single guest or related guests run on sibling SMT threads on
 | |
|    the same physical core then they can only attack their own memory and
 | |
|    restricted parts of the host memory.
 | |
| 
 | |
|    Host memory is attackable, when one of the sibling SMT threads runs in
 | |
|    host OS (hypervisor) context and the other in guest context. The amount
 | |
|    of valuable information from the host OS context depends on the context
 | |
|    which the host OS executes, i.e. interrupts, soft interrupts and kernel
 | |
|    threads. The amount of valuable data from these contexts cannot be
 | |
|    declared as non-interesting for an attacker without deep inspection of
 | |
|    the code.
 | |
| 
 | |
|    **Note**, that assigning guests to a fixed set of physical cores affects
 | |
|    the ability of the scheduler to do load balancing and might have
 | |
|    negative effects on CPU utilization depending on the hosting
 | |
|    scenario. Disabling SMT might be a viable alternative for particular
 | |
|    scenarios.
 | |
| 
 | |
|    For further information about confining guests to a single or to a group
 | |
|    of cores consult the cpusets documentation:
 | |
| 
 | |
|    https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst
 | |
| 
 | |
| .. _interrupt_isolation:
 | |
| 
 | |
| 3. Interrupt affinity
 | |
| ^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
|    Interrupts can be made affine to logical CPUs. This is not universally
 | |
|    true because there are types of interrupts which are truly per CPU
 | |
|    interrupts, e.g. the local timer interrupt. Aside of that multi queue
 | |
|    devices affine their interrupts to single CPUs or groups of CPUs per
 | |
|    queue without allowing the administrator to control the affinities.
 | |
| 
 | |
|    Moving the interrupts, which can be affinity controlled, away from CPUs
 | |
|    which run untrusted guests, reduces the attack vector space.
 | |
| 
 | |
|    Whether the interrupts with are affine to CPUs, which run untrusted
 | |
|    guests, provide interesting data for an attacker depends on the system
 | |
|    configuration and the scenarios which run on the system. While for some
 | |
|    of the interrupts it can be assumed that they won't expose interesting
 | |
|    information beyond exposing hints about the host OS memory layout, there
 | |
|    is no way to make general assumptions.
 | |
| 
 | |
|    Interrupt affinity can be controlled by the administrator via the
 | |
|    /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
 | |
|    available at:
 | |
| 
 | |
|    https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst
 | |
| 
 | |
| .. _smt_control:
 | |
| 
 | |
| 4. SMT control
 | |
| ^^^^^^^^^^^^^^
 | |
| 
 | |
|    To prevent the SMT issues of L1TF it might be necessary to disable SMT
 | |
|    completely. Disabling SMT can have a significant performance impact, but
 | |
|    the impact depends on the hosting scenario and the type of workloads.
 | |
|    The impact of disabling SMT needs also to be weighted against the impact
 | |
|    of other mitigation solutions like confining guests to dedicated cores.
 | |
| 
 | |
|    The kernel provides a sysfs interface to retrieve the status of SMT and
 | |
|    to control it. It also provides a kernel command line interface to
 | |
|    control SMT.
 | |
| 
 | |
|    The kernel command line interface consists of the following options:
 | |
| 
 | |
|      =========== ==========================================================
 | |
|      nosmt	 Affects the bring up of the secondary CPUs during boot. The
 | |
| 		 kernel tries to bring all present CPUs online during the
 | |
| 		 boot process. "nosmt" makes sure that from each physical
 | |
| 		 core only one - the so called primary (hyper) thread is
 | |
| 		 activated. Due to a design flaw of Intel processors related
 | |
| 		 to Machine Check Exceptions the non primary siblings have
 | |
| 		 to be brought up at least partially and are then shut down
 | |
| 		 again.  "nosmt" can be undone via the sysfs interface.
 | |
| 
 | |
|      nosmt=force Has the same effect as "nosmt" but it does not allow to
 | |
| 		 undo the SMT disable via the sysfs interface.
 | |
|      =========== ==========================================================
 | |
| 
 | |
|    The sysfs interface provides two files:
 | |
| 
 | |
|    - /sys/devices/system/cpu/smt/control
 | |
|    - /sys/devices/system/cpu/smt/active
 | |
| 
 | |
|    /sys/devices/system/cpu/smt/control:
 | |
| 
 | |
|      This file allows to read out the SMT control state and provides the
 | |
|      ability to disable or (re)enable SMT. The possible states are:
 | |
| 
 | |
| 	==============  ===================================================
 | |
| 	on		SMT is supported by the CPU and enabled. All
 | |
| 			logical CPUs can be onlined and offlined without
 | |
| 			restrictions.
 | |
| 
 | |
| 	off		SMT is supported by the CPU and disabled. Only
 | |
| 			the so called primary SMT threads can be onlined
 | |
| 			and offlined without restrictions. An attempt to
 | |
| 			online a non-primary sibling is rejected
 | |
| 
 | |
| 	forceoff	Same as 'off' but the state cannot be controlled.
 | |
| 			Attempts to write to the control file are rejected.
 | |
| 
 | |
| 	notsupported	The processor does not support SMT. It's therefore
 | |
| 			not affected by the SMT implications of L1TF.
 | |
| 			Attempts to write to the control file are rejected.
 | |
| 	==============  ===================================================
 | |
| 
 | |
|      The possible states which can be written into this file to control SMT
 | |
|      state are:
 | |
| 
 | |
|      - on
 | |
|      - off
 | |
|      - forceoff
 | |
| 
 | |
|    /sys/devices/system/cpu/smt/active:
 | |
| 
 | |
|      This file reports whether SMT is enabled and active, i.e. if on any
 | |
|      physical core two or more sibling threads are online.
 | |
| 
 | |
|    SMT control is also possible at boot time via the l1tf kernel command
 | |
|    line parameter in combination with L1D flush control. See
 | |
|    :ref:`mitigation_control_command_line`.
 | |
| 
 | |
| 5. Disabling EPT
 | |
| ^^^^^^^^^^^^^^^^
 | |
| 
 | |
|   Disabling EPT for virtual machines provides full mitigation for L1TF even
 | |
|   with SMT enabled, because the effective page tables for guests are
 | |
|   managed and sanitized by the hypervisor. Though disabling EPT has a
 | |
|   significant performance impact especially when the Meltdown mitigation
 | |
|   KPTI is enabled.
 | |
| 
 | |
|   EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
 | |
| 
 | |
| There is ongoing research and development for new mitigation mechanisms to
 | |
| address the performance impact of disabling SMT or EPT.
 | |
| 
 | |
| .. _mitigation_control_command_line:
 | |
| 
 | |
| Mitigation control on the kernel command line
 | |
| ---------------------------------------------
 | |
| 
 | |
| The kernel command line allows to control the L1TF mitigations at boot
 | |
| time with the option "l1tf=". The valid arguments for this option are:
 | |
| 
 | |
|   ============  =============================================================
 | |
|   full		Provides all available mitigations for the L1TF
 | |
| 		vulnerability. Disables SMT and enables all mitigations in
 | |
| 		the hypervisors, i.e. unconditional L1D flushing
 | |
| 
 | |
| 		SMT control and L1D flush control via the sysfs interface
 | |
| 		is still possible after boot.  Hypervisors will issue a
 | |
| 		warning when the first VM is started in a potentially
 | |
| 		insecure configuration, i.e. SMT enabled or L1D flush
 | |
| 		disabled.
 | |
| 
 | |
|   full,force	Same as 'full', but disables SMT and L1D flush runtime
 | |
| 		control. Implies the 'nosmt=force' command line option.
 | |
| 		(i.e. sysfs control of SMT is disabled.)
 | |
| 
 | |
|   flush		Leaves SMT enabled and enables the default hypervisor
 | |
| 		mitigation, i.e. conditional L1D flushing
 | |
| 
 | |
| 		SMT control and L1D flush control via the sysfs interface
 | |
| 		is still possible after boot.  Hypervisors will issue a
 | |
| 		warning when the first VM is started in a potentially
 | |
| 		insecure configuration, i.e. SMT enabled or L1D flush
 | |
| 		disabled.
 | |
| 
 | |
|   flush,nosmt	Disables SMT and enables the default hypervisor mitigation,
 | |
| 		i.e. conditional L1D flushing.
 | |
| 
 | |
| 		SMT control and L1D flush control via the sysfs interface
 | |
| 		is still possible after boot.  Hypervisors will issue a
 | |
| 		warning when the first VM is started in a potentially
 | |
| 		insecure configuration, i.e. SMT enabled or L1D flush
 | |
| 		disabled.
 | |
| 
 | |
|   flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is
 | |
| 		started in a potentially insecure configuration.
 | |
| 
 | |
|   off		Disables hypervisor mitigations and doesn't emit any
 | |
| 		warnings.
 | |
| 		It also drops the swap size and available RAM limit restrictions
 | |
| 		on both hypervisor and bare metal.
 | |
| 
 | |
|   ============  =============================================================
 | |
| 
 | |
| The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
 | |
| 
 | |
| 
 | |
| .. _mitigation_control_kvm:
 | |
| 
 | |
| Mitigation control for KVM - module parameter
 | |
| -------------------------------------------------------------
 | |
| 
 | |
| The KVM hypervisor mitigation mechanism, flushing the L1D cache when
 | |
| entering a guest, can be controlled with a module parameter.
 | |
| 
 | |
| The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
 | |
| following arguments:
 | |
| 
 | |
|   ============  ==============================================================
 | |
|   always	L1D cache flush on every VMENTER.
 | |
| 
 | |
|   cond		Flush L1D on VMENTER only when the code between VMEXIT and
 | |
| 		VMENTER can leak host memory which is considered
 | |
| 		interesting for an attacker. This still can leak host memory
 | |
| 		which allows e.g. to determine the hosts address space layout.
 | |
| 
 | |
|   never		Disables the mitigation
 | |
|   ============  ==============================================================
 | |
| 
 | |
| The parameter can be provided on the kernel command line, as a module
 | |
| parameter when loading the modules and at runtime modified via the sysfs
 | |
| file:
 | |
| 
 | |
| /sys/module/kvm_intel/parameters/vmentry_l1d_flush
 | |
| 
 | |
| The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
 | |
| line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
 | |
| module parameter is ignored and writes to the sysfs file are rejected.
 | |
| 
 | |
| .. _mitigation_selection:
 | |
| 
 | |
| Mitigation selection guide
 | |
| --------------------------
 | |
| 
 | |
| 1. No virtualization in use
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
|    The system is protected by the kernel unconditionally and no further
 | |
|    action is required.
 | |
| 
 | |
| 2. Virtualization with trusted guests
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
|    If the guest comes from a trusted source and the guest OS kernel is
 | |
|    guaranteed to have the L1TF mitigations in place the system is fully
 | |
|    protected against L1TF and no further action is required.
 | |
| 
 | |
|    To avoid the overhead of the default L1D flushing on VMENTER the
 | |
|    administrator can disable the flushing via the kernel command line and
 | |
|    sysfs control files. See :ref:`mitigation_control_command_line` and
 | |
|    :ref:`mitigation_control_kvm`.
 | |
| 
 | |
| 
 | |
| 3. Virtualization with untrusted guests
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| 3.1. SMT not supported or disabled
 | |
| """"""""""""""""""""""""""""""""""
 | |
| 
 | |
|   If SMT is not supported by the processor or disabled in the BIOS or by
 | |
|   the kernel, it's only required to enforce L1D flushing on VMENTER.
 | |
| 
 | |
|   Conditional L1D flushing is the default behaviour and can be tuned. See
 | |
|   :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
 | |
| 
 | |
| 3.2. EPT not supported or disabled
 | |
| """"""""""""""""""""""""""""""""""
 | |
| 
 | |
|   If EPT is not supported by the processor or disabled in the hypervisor,
 | |
|   the system is fully protected. SMT can stay enabled and L1D flushing on
 | |
|   VMENTER is not required.
 | |
| 
 | |
|   EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
 | |
| 
 | |
| 3.3. SMT and EPT supported and active
 | |
| """""""""""""""""""""""""""""""""""""
 | |
| 
 | |
|   If SMT and EPT are supported and active then various degrees of
 | |
|   mitigations can be employed:
 | |
| 
 | |
|   - L1D flushing on VMENTER:
 | |
| 
 | |
|     L1D flushing on VMENTER is the minimal protection requirement, but it
 | |
|     is only potent in combination with other mitigation methods.
 | |
| 
 | |
|     Conditional L1D flushing is the default behaviour and can be tuned. See
 | |
|     :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
 | |
| 
 | |
|   - Guest confinement:
 | |
| 
 | |
|     Confinement of guests to a single or a group of physical cores which
 | |
|     are not running any other processes, can reduce the attack surface
 | |
|     significantly, but interrupts, soft interrupts and kernel threads can
 | |
|     still expose valuable data to a potential attacker. See
 | |
|     :ref:`guest_confinement`.
 | |
| 
 | |
|   - Interrupt isolation:
 | |
| 
 | |
|     Isolating the guest CPUs from interrupts can reduce the attack surface
 | |
|     further, but still allows a malicious guest to explore a limited amount
 | |
|     of host physical memory. This can at least be used to gain knowledge
 | |
|     about the host address space layout. The interrupts which have a fixed
 | |
|     affinity to the CPUs which run the untrusted guests can depending on
 | |
|     the scenario still trigger soft interrupts and schedule kernel threads
 | |
|     which might expose valuable information. See
 | |
|     :ref:`interrupt_isolation`.
 | |
| 
 | |
| The above three mitigation methods combined can provide protection to a
 | |
| certain degree, but the risk of the remaining attack surface has to be
 | |
| carefully analyzed. For full protection the following methods are
 | |
| available:
 | |
| 
 | |
|   - Disabling SMT:
 | |
| 
 | |
|     Disabling SMT and enforcing the L1D flushing provides the maximum
 | |
|     amount of protection. This mitigation is not depending on any of the
 | |
|     above mitigation methods.
 | |
| 
 | |
|     SMT control and L1D flushing can be tuned by the command line
 | |
|     parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
 | |
|     time with the matching sysfs control files. See :ref:`smt_control`,
 | |
|     :ref:`mitigation_control_command_line` and
 | |
|     :ref:`mitigation_control_kvm`.
 | |
| 
 | |
|   - Disabling EPT:
 | |
| 
 | |
|     Disabling EPT provides the maximum amount of protection as well. It is
 | |
|     not depending on any of the above mitigation methods. SMT can stay
 | |
|     enabled and L1D flushing is not required, but the performance impact is
 | |
|     significant.
 | |
| 
 | |
|     EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
 | |
|     parameter.
 | |
| 
 | |
| 3.4. Nested virtual machines
 | |
| """"""""""""""""""""""""""""
 | |
| 
 | |
| When nested virtualization is in use, three operating systems are involved:
 | |
| the bare metal hypervisor, the nested hypervisor and the nested virtual
 | |
| machine.  VMENTER operations from the nested hypervisor into the nested
 | |
| guest will always be processed by the bare metal hypervisor. If KVM is the
 | |
| bare metal hypervisor it will:
 | |
| 
 | |
|  - Flush the L1D cache on every switch from the nested hypervisor to the
 | |
|    nested virtual machine, so that the nested hypervisor's secrets are not
 | |
|    exposed to the nested virtual machine;
 | |
| 
 | |
|  - Flush the L1D cache on every switch from the nested virtual machine to
 | |
|    the nested hypervisor; this is a complex operation, and flushing the L1D
 | |
|    cache avoids that the bare metal hypervisor's secrets are exposed to the
 | |
|    nested virtual machine;
 | |
| 
 | |
|  - Instruct the nested hypervisor to not perform any L1D cache flush. This
 | |
|    is an optimization to avoid double L1D flushing.
 | |
| 
 | |
| 
 | |
| .. _default_mitigations:
 | |
| 
 | |
| Default mitigations
 | |
| -------------------
 | |
| 
 | |
|   The kernel default mitigations for vulnerable processors are:
 | |
| 
 | |
|   - PTE inversion to protect against malicious user space. This is done
 | |
|     unconditionally and cannot be controlled. The swap storage is limited
 | |
|     to ~16TB.
 | |
| 
 | |
|   - L1D conditional flushing on VMENTER when EPT is enabled for
 | |
|     a guest.
 | |
| 
 | |
|   The kernel does not by default enforce the disabling of SMT, which leaves
 | |
|   SMT systems vulnerable when running untrusted guests with EPT enabled.
 | |
| 
 | |
|   The rationale for this choice is:
 | |
| 
 | |
|   - Force disabling SMT can break existing setups, especially with
 | |
|     unattended updates.
 | |
| 
 | |
|   - If regular users run untrusted guests on their machine, then L1TF is
 | |
|     just an add on to other malware which might be embedded in an untrusted
 | |
|     guest, e.g. spam-bots or attacks on the local network.
 | |
| 
 | |
|     There is no technical way to prevent a user from running untrusted code
 | |
|     on their machines blindly.
 | |
| 
 | |
|   - It's technically extremely unlikely and from today's knowledge even
 | |
|     impossible that L1TF can be exploited via the most popular attack
 | |
|     mechanisms like JavaScript because these mechanisms have no way to
 | |
|     control PTEs. If this would be possible and not other mitigation would
 | |
|     be possible, then the default might be different.
 | |
| 
 | |
|   - The administrators of cloud and hosting setups have to carefully
 | |
|     analyze the risk for their scenarios and make the appropriate
 | |
|     mitigation choices, which might even vary across their deployed
 | |
|     machines and also result in other changes of their overall setup.
 | |
|     There is no way for the kernel to provide a sensible default for this
 | |
|     kind of scenarios.
 |