Both i386 and x86_64 now copy the relocate_kernel() function into the control
page and execute it from there, using an open-coded function pointer.
Use a typedef for it instead.
[ bp: Put relocate_kernel_ptr ptr arithmetic on a single line. ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-10-dwmw2@infradead.org
A few places in the kexec control code page make the assumption that the first
instruction of relocate_kernel is at the very start of the page.
To allow for Clang CFI information to be added to relocate_kernel(), as well
as the general principle of removing unwarranted assumptions, fix them to use
the external __relocate_kernel_start symbol that the linker adds. This means
using a separate addq and subq for calculating offsets, as the assembler can
no longer calculate the delta directly for itself and relocations aren't that
versatile. But those values can at least be used relative to a local label to
avoid absolute relocations.
Turn the jump from relocate_kernel() to identity_mapped() into a real indirect
'jmp *%rsi' too, while touching it. There was no real reason for it to be
a push+ret in the first place, and adding Clang CFI info will also give
objtool enough visibility to start complaining 'return with modified stack
frame' about it.
[ bp: Massage commit message. ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-9-dwmw2@infradead.org
A recent commit caused the relocate_kernel() function to be invoked through
a function pointer, but it does not have CFI information. The resulting trap
occurs after the IDT and GDT have been invalidated, leading to a triple-fault
if CONFIG_CFI_CLANG is enabled.
Using SYM_TYPED_FUNC_START() to provide the CFI information looks like it will
require a prolonged battle with objtool. And is fairly pointless anyway, as
the actual signature comes from a __kcfi_typeid_… symbol emitted from the
C code based on the function prototype it thinks that relocate_kernel has,
rendering the check somewhat tautological.
The simple fix is just to mark machine_kexec() with __nocfi.
Fixes: eeebbde571 ("x86/kexec: Invoke copy of relocate_kernel() instead of the original")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-7-dwmw2@infradead.org
After commit
cb33ff9e06 ("x86/kexec: Move relocate_kernel to kernel .data section"),
kernels configured with an option that uses -ffunction-sections, such as
CONFIG_LTO_CLANG, crash when kexecing because the value of relocate_kernel
does not match the value of __relocate_kernel_start so incorrect code gets
copied via machine_kexec_prepare().
$ llvm-nm good-vmlinux &| rg relocate_kernel
ffffffff83280d41 T __relocate_kernel_end
ffffffff83280b00 T __relocate_kernel_start
ffffffff83280b00 T relocate_kernel
$ llvm-nm bad-vmlinux &| rg relocate_kernel
ffffffff83266100 D __relocate_kernel_end
ffffffff83266100 D __relocate_kernel_start
ffffffff8120b0d8 T relocate_kernel
When -ffunction-sections is enabled, TEXT_MAIN matches on
'.text.[0-9a-zA-Z_]*' to coalesce the function specific functions back
into .text during link time after they have been optimized. Due to the
placement of TEXT_TEXT before KEXEC_RELOCATE_KERNEL in the x86 linker
script, the .text.relocate_kernel section ends up in .text instead of
.data.
Use a second dot in the relocate_kernel section name to avoid matching
on TEXT_MAIN, which matches a similar situation that happened in
commit
79cd2a1122 ("x86/retpoline,kprobes: Fix position of thunk sections with CONFIG_LTO_CLANG"),
which allows kexec to function properly.
While .data.relocate_kernel still ends up in the .data section via
DATA_MAIN -> DATA_DATA, ensure it is located with the
.text.relocate_kernel section as intended by performing the same
transformation.
Fixes: cb33ff9e06 ("x86/kexec: Move relocate_kernel to kernel .data section")
Fixes: 8dbec5c77b ("x86/kexec: Add data section to relocate_kernel")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-6-dwmw2@infradead.org
A ::preserve_context kimage can be invoked more than once, and the entry point
can be different every time. When the callee returns to the kernel, it leaves
the address of its entry point for next time on the stack.
That being the case, one might reasonably assume that the caller would
allocate space for it on the stack frame before actually performing the 'call'
into the callee.
Apparently not, though. Ever since the kjump code was first added in 2009, it
has set up a *new* stack at the top of the swap_page scratch page, then just
performed the 'call' without allocating any space for the re-entry address to
be returned. It then reads the re-entry point for next time from 0(%rsp) which
is actually the first qword of the page *after* the swap page, which might not
exist at all! And if the callee has written to that, then it will have
corrupted memory it doesn't own.
Correct this by pushing the entry point of the callee onto the stack before
calling it. The callee may then adjust it, or not, as it sees fit, and
subsequent invocations should work correctly either way.
Remove a stray push of zero to the *relocate_kernel* stack, which may have
been intended for this purpose, but which was actually just noise.
Also, loading the stack for the callee relied on the address of the swap page
being in %r10 without ever documenting that fact. Recent code changes made
that no longer true, so load it directly from the local kexec_pa_swap_page
variable instead.
Fixes: b3adabae8a ("x86/kexec: Drop page_list argument from relocate_kernel()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-5-dwmw2@infradead.org
The swap_pages function expects the swap page to be in %r10, but there
was no documentation to that effect. Once upon a time the setup code
used to load its value from a kernel virtual address and save it to an
address which is accessible in the identity-mapped page tables, and
*happened* to use %r10 to do so, with no comment that it was left there
on *purpose* instead of just being a scratch register. Once that was no
longer necessary, %r10 just holds whatever the kernel happened to leave
in it.
Now that the original value passed by the kernel is accessible via
%rip-relative addressing, load directly from there instead of using %r10
for it. But document the other parameters that the swap_pages function
*does* expect in registers.
Fixes: b3adabae8a ("x86/kexec: Drop page_list argument from relocate_kernel()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-4-dwmw2@infradead.org
The swap_pages() function will only actually *swap*, as its name implies, if
the preserve_context flag in the %r11 register is non-zero. On the way back
from a ::preserve_context kexec, ensure that the %r11 register is non-zero so
that the pages get swapped back.
Fixes: 9e5683e2d0 ("x86/kexec: Only swap pages for ::preserve_context mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-3-dwmw2@infradead.org
The kernel switches to a new set of page tables during kexec. The global
mappings (_PAGE_GLOBAL==1) can remain in the TLB after this switch. This
is generally not a problem because the new page tables use a different
portion of the virtual address space than the normal kernel mappings.
The critical exception to that generalisation (and the only mapping
which isn't an identity mapping) is the kexec control page itself —
which was ROX in the original kernel mapping, but should be RWX in the
new page tables. If there is a global TLB entry for that in its prior
read-only state, it definitely needs to be flushed before attempting to
write through that virtual mapping.
It would be possible to just avoid writing to the virtual address of the
page and defer all writes until they can be done through the identity
mapping. But there's no good reason to keep the old TLB entries around,
as they can cause nothing but trouble.
Clear the PGE bit in %cr4 early, before storing data in the control page.
Fixes: 5a82223e07 ("x86/kexec: Mark relocate_kernel page as ROX instead of RWX")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219592
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Link: https://lore.kernel.org/r/20250109140757.2841269-2-dwmw2@infradead.org
The early_ioremap interface can fail and return NULL in certain cases. To
prevent NULL-pointer dereference crashes, fixed issues in the acpi_extlog
and copy_early_mem interfaces, improving robustness when handling early
memory.
Link: https://lkml.kernel.org/r/20241212101004.1544070-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
will be freed by semi RCU, that is:
- batch table freeing: asynchronous free by RCU
- single table freeing: IPI + synchronous free
In this way, the page table can be lockless traversed by disabling IRQ in
paths such as fast GUP. But this is not enough to free the empty PTE page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().
In preparation for supporting empty PTE page table pages reclaimation, let
single table also be freed by RCU like batch table freeing. Then we can
also use pte_offset_map() etc to prevent PTE page from being freed.
Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
the page table pages:
- The pt_rcu_head is unioned with pt_list and pmd_huge_pte.
- For pt_list, it is used to manage the PGD page in x86. Fortunately
tlb_remove_table() will not be used for free PGD pages, so it is safe
to use pt_rcu_head.
- For pmd_huge_pte, it is used for THPs, so it is safe.
After applying this patch, if CONFIG_PT_RECLAIM is enabled, the function
call of free_pte() is as follows:
free_pte
pte_free_tlb
__pte_free_tlb
___pte_free_tlb
paravirt_tlb_remove_table
tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
[no-free-memory slowpath:]
tlb_table_invalidate
tlb_remove_table_one
__tlb_remove_table_one [frees via RCU]
[fastpath:]
tlb_table_flush
tlb_remove_table_free [frees via RCU]
native_tlb_remove_table [CONFIG_PARAVIRT on native]
tlb_remove_table [see above]
Link: https://lkml.kernel.org/r/0287d442a973150b0e1019cc406e6322d148277a.1733305182.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hpet_rtc_dropped_irq() has been unused since
commit f52ef24be2 ("rtc/alpha: remove legacy rtc driver")
Remove it in rtc, and x86 hpet code.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241215022356.181625-1-linux@treblig.org
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
x86_sched_itmt_flags() returns SD_ASYM_PACKING if ITMT support is
enabled by the system. Without ITMT support being enabled, it returns 0
similar to current x86_die_flags() on non-Hybrid systems
(!X86_HYBRID_CPU and !X86_FEATURE_AMD_HETEROGENEOUS_CORES)
On Intel systems that enable ITMT support, either the MC domain
coincides with the PKG domain, or in case of multiple MC groups
within a PKG domain, either Sub-NUMA Cluster (SNC) is enabled or the
processor features Hybrid core layout (X86_HYBRID_CPU) which leads to
three distinct possibilities:
o If PKG and MC domains coincide, PKG domain is degenerated by
sd_parent_degenerate() when building sched domain topology.
o If SNC is enabled, PKG domain is never added since
"x86_has_numa_in_package" is set and the topology will instead contain
NODE and NUMA domains.
o On X86_HYBRID_CPU which contains multiple MC groups within the PKG,
the PKG domain requires x86_sched_itmt_flags().
Thus, on Intel systems that contains multiple MC groups within the PKG
and enables ITMT support, the PKG domain requires
x86_sched_itmt_flags(). In all other cases PKG domain is either never
added or is degenerated. Thus, returning x86_sched_itmt_flags()
unconditionally at PKG domain on Intel systems should not lead to any
functional changes.
On AMD systems with multiple LLCs (MC groups) within a PKG domain,
enabling ITMT support requires setting SD_ASYM_PACKING to the PKG domain
since the core rankings are assigned PKG-wide.
Core rankings on AMD processors is currently set by the amd-pstate
driver when Preferred Core feature is supported. A subset of systems that
support Preferred Core feature can be detected using
X86_FEATURE_AMD_HETEROGENEOUS_CORES however, this does not cover all the
systems that support Preferred Core ranking.
Detecting Preferred Core support on AMD systems requires inspecting CPPC
Highest Perf on all present CPUs and checking if it differs on at least
one CPU. Previous suggestion to use a synthetic feature to detect
Preferred Core support [1] was found to be non-trivial to implement
since BSP alone cannot detect if Preferred Core is supported and by the
time AP comes up, alternatives are patched and setting a X86_FEATURE_*
then is not possible.
Since x86 processors enabling ITMT support that consists multiple
non-NUMA MC groups within a PKG requires SD_ASYM_PACKING flag set at the
PKG domain, return x86_sched_itmt_flags unconditionally for the PKG
domain.
Since x86_die_flags() would have just returned x86_sched_itmt_flags()
after the change, remove the unnecessary wrapper and pass
x86_sched_itmt_flags() directly as the flags function.
Fixes: f3a0523918 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-6-kprateek.nayak@amd.com
x86_*_flags() wrappers were introduced with commit d3d37d850d
("x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU") to add
x86_sched_itmt_flags() in addition to the default domain flags for SMT
and MC domain.
commit 995998ebde ("x86/sched: Remove SD_ASYM_PACKING from the
SMT domain flags") removed the ITMT flags for SMT domain but not the
x86_smt_flags() wrappers which directly returns cpu_smt_flags().
Remove x86_smt_flags() and directly use cpu_smt_flags() to derive the
flags for SMT domain. No functional changes intended.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-5-kprateek.nayak@amd.com
"sched_itmt_enabled" was only introduced as a debug toggle for any funky
ITMT behavior. Move the sysctl controlled from
"/proc/sys/kernel/sched_itmt_enabled" to debugfs at
"/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a
cat on the file will return "Y" or "N" instead of "1" or "0" to
indicate that feature is enabled or disabled respectively. Either "0" or
"N" (or any string that kstrtobool() interprets as false) can be written
to the file will disable the feature, and writing either "1" or "Y" (or
any string that kstrtobool() interprets as true) will enable it back
when the platform supports ITMT ranking.
Since ITMT is x86 specific (and PowerPC uses SD_ASYM_PACKING too), the
toggle was moved to "/sys/kernel/debug/x86/" as opposed to
"/sys/kernel/debug/sched/"
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-4-kprateek.nayak@amd.com
Use guard() for itmt_update_mutex which avoids the extra mutex_unlock()
in the bailout and return paths.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-3-kprateek.nayak@amd.com
In preparation to move "sysctl_sched_itmt_enabled" to debugfs, convert
the unsigned int to bool since debugfs readily exposes boolean fops
primitives (debugfs_read_file_bool, debugfs_write_file_bool) which can
streamline the conversion.
Since the current ctl_table initializes extra1 and extra2 to SYSCTL_ZERO
and SYSCTL_ONE respectively, the value of "sysctl_sched_itmt_enabled"
can only be 0 or 1 and this datatype conversion should not cause any
functional changes.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-2-kprateek.nayak@amd.com
Since task->comm is guaranteed to be NUL-terminated, we can print it
directly without the need to copy it into a separate buffer. This
simplifies the code and avoids unnecessary operations.
Link: https://lkml.kernel.org/r/20241219023452.69907-3-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "André Almeida" <andrealmeid@igalia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Danilo Krummrich <dakr@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch to using hvhdk.h everywhere in the kernel. This header
includes all the new Hyper-V headers in include/hyperv, which form a
superset of the definitions found in hyperv-tlfs.h.
This makes it easier to add new Hyper-V interfaces without being
restricted to those in the TLFS doc (reflected in hyperv-tlfs.h).
To be more consistent with the original Hyper-V code, the names of
some definitions are changed slightly. Update those where needed.
Update comments in mshyperv.h files to point to include/hyperv for
adding new definitions.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com
Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Use the GUEST_TSC_FREQ MSR to discover the TSC frequency instead of
relying on kvm-clock based frequency calibration. Override both CPU and
TSC frequency calibration callbacks with securetsc_get_tsc_khz(). Since
the difference between CPU base and TSC frequency does not apply in this
case, the same callback is being used.
[ bp: Carve out from
https://lore.kernel.org/r/20250106124633.1418972-11-nikunj@amd.com ]
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250106124633.1418972-11-nikunj@amd.com
There are more than one SMN index/data pair available for software use.
The register offsets are different, but the protocol is the same.
Use defines for the SMN offset values and allow the index/data offsets
to be passed to the read/write helper function.
This eases code reuse with other SMN users in the kernel.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-14-yazen.ghannam@amd.com
Use guard(mutex) and convert PCI error codes to common ones.
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-12-yazen.ghannam@amd.com
SMN access was bolted into amd_nb mostly as convenience. This has
limitations though that require incurring tech debt to keep it working.
Move SMN access to the newly introduced AMD Node driver.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> # pdx86
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> # PMF, PMC
Link: https://lore.kernel.org/r/20241206161210.163701-11-yazen.ghannam@amd.com
Currently, the total AMD node count is determined by searching and counting
CPU/node devices using PCI IDs.
However, AMD node information is already available through topology
CPUID/MSRs. The recent topology rework has made this info easier to access.
Replace the node counting code with a simple product of topology info.
Every node/northbridge is expected to have a 'misc' device. Clear everything
out if a 'misc' device isn't found on a node.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250107222847.3300430-7-yazen.ghannam@amd.com
The "root" device search was introduced to support SMN access for Zen
systems. This device represents a PCIe root complex. It is not the
same as the "CPU/node" devices found at slots 0x18-0x1F.
There may be multiple PCIe root complexes within an AMD node. Such is
the case with server or High-end Desktop (HEDT) systems, etc. Therefore
it is not enough to assume "root <-> AMD node" is a 1-to-1 association.
Currently, this is handled by skipping "extra" root complexes during the
search. However, the hardware provides the PCI bus number of an AMD
node's root device.
Use the hardware info to get the root device's bus and drop the extra
search code and PCI IDs.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-7-yazen.ghannam@amd.com
Use the newly added helper function to look up a CPU/Node function to
find "function 4" devices.
Thus, avoid the need to regularly add new PCI IDs for basic discovery.
The unique PCI IDs are still useful in case of quirks or functional
changes. And they should be used only in such a manner.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-6-yazen.ghannam@amd.com
The "AMD Node" concept spans many families of systems and applies to
a number of subsystems and drivers.
Currently, the AMD Northbridge code is overloaded with AMD node
functionality. However, the node concept is broader than just
northbridges.
Start files to host common AMD node functions and definitions. Include
a helper to find an AMD node device function based on the convention
described in AMD documentation.
Anything that needs node functionality should include this rather than
amd_nb.h. The AMD_NB code will be reduced to only northbridge-specific
code needed for legacy systems.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-5-yazen.ghannam@amd.com
The check for early_is_amd_nb() is only useful for systems with GART or
the NB_CFG register.
Zen-based systems (both AMD and Hygon) have neither, so return early for
them.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-4-yazen.ghannam@amd.com
The code implicitly operates on AMD-based systems by matching on PCI
IDs. However, the use of these IDs is going away.
Add an explicit CPU vendor check instead of relying on PCI IDs.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-3-yazen.ghannam@amd.com
The x86 shadow stack support has its own set of registers. Those registers
are XSAVE-managed, but they are "supervisor state components" which means
that userspace can not touch them with XSAVE/XRSTOR. It also means that
they are not accessible from the existing ptrace ABI for XSAVE state.
Thus, there is a new ptrace get/set interface for it.
The regset code that ptrace uses provides an ->active() handler in
addition to the get/set ones. For shadow stack this ->active() handler
verifies that shadow stack is enabled via the ARCH_SHSTK_SHSTK bit in the
thread struct. The ->active() handler is checked from some call sites of
the regset get/set handlers, but not the ptrace ones. This was not
understood when shadow stack support was put in place.
As a result, both the set/get handlers can be called with
XFEATURE_CET_USER in its init state, which would cause get_xsave_addr() to
return NULL and trigger a WARN_ON(). The ssp_set() handler luckily has an
ssp_active() check to avoid surprising the kernel with shadow stack
behavior when the kernel is not ready for it (ARCH_SHSTK_SHSTK==0). That
check just happened to avoid the warning.
But the ->get() side wasn't so lucky. It can be called with shadow stacks
disabled, triggering the warning in practice, as reported by Christina
Schimpe:
WARNING: CPU: 5 PID: 1773 at arch/x86/kernel/fpu/regset.c:198 ssp_get+0x89/0xa0
[...]
Call Trace:
<TASK>
? show_regs+0x6e/0x80
? ssp_get+0x89/0xa0
? __warn+0x91/0x150
? ssp_get+0x89/0xa0
? report_bug+0x19d/0x1b0
? handle_bug+0x46/0x80
? exc_invalid_op+0x1d/0x80
? asm_exc_invalid_op+0x1f/0x30
? __pfx_ssp_get+0x10/0x10
? ssp_get+0x89/0xa0
? ssp_get+0x52/0xa0
__regset_get+0xad/0xf0
copy_regset_to_user+0x52/0xc0
ptrace_regset+0x119/0x140
ptrace_request+0x13c/0x850
? wait_task_inactive+0x142/0x1d0
? do_syscall_64+0x6d/0x90
arch_ptrace+0x102/0x300
[...]
Ensure that shadow stacks are active in a thread before looking them up
in the XSAVE buffer. Since ARCH_SHSTK_SHSTK and user_ssp[SHSTK_EN] are
set at the same time, the active check ensures that there will be
something to find in the XSAVE buffer.
[ dhansen: changelog/subject tweaks ]
Fixes: 2fab02b25a ("x86: Add PTRACE interface for shadow stack")
Reported-by: Christina Schimpe <christina.schimpe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Christina Schimpe <christina.schimpe@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250107233056.235536-1-rick.p.edgecombe%40intel.com
Legacy AMD systems include an integrated Northbridge that is represented
by MCA bank 4. This is the only non-core MCA bank in legacy systems. The
Northbridge is physically shared by all the CPUs within an AMD "Node".
However, in practice the "shared" MCA bank can only by managed by a
single CPU within that AMD Node. This is known as the "Node Base Core"
(NBC). For example, only the NBC will be able to read the MCA bank 4
registers; they will be Read-as-Zero for other CPUs. Also, the MCA
Thresholding interrupt will only signal the NBC; the other CPUs will not
receive it. This is enforced by hardware, and it should not be managed by
software.
The current AMD Thresholding code attempts to deal with the "shared" MCA
bank by micromanaging the bank's sysfs kobjects. However, this does not
follow the intended kobject use cases. It is also fragile, and it has
caused bugs in the past.
Modern AMD systems do not need this shared MCA bank support, and it
should not be needed on legacy systems either.
Remove the shared threshold bank code. Also, move the threshold struct
definitions to mce/amd.c, since they are no longer needed in amd_nb.c.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-2-yazen.ghannam@amd.com
__static_call_update_early() has a check for early_boot_irqs_disabled, but
is used before early_boot_irqs_disabled is set up in start_kernel().
Xen PV has always special cased early_boot_irqs_disabled, but Xen PVH does
not and falls over the BUG when booting as dom0.
It is very suspect that early_boot_irqs_disabled starts as 0, becomes 1 for
a time, then becomes 0 again, but as this needs backporting to fix a
breakage in a security fix, dropping the BUG_ON() is the far safer option.
Fixes: 0ef8047b73 ("x86/static-call: provide a way to do very early static-call updates")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219620
Reported-by: Alex Zenla <alex@edera.dev>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alex Zenla <alex@edera.dev>
Link: https://lore.kernel.org/r/20241221211046.6475-1-andrew.cooper3@citrix.com
The result of that function is in essence boolean, so simplify to return the
result of the relevant expression. It also makes it follow the convention used
by __verify_patch_section().
No functional changes.
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241018155151.702350-3-nik.borisov@suse.com
Instead of open-coding the check for size/data move it inside the
function and make it return a boolean indicating whether data was found
or not.
No functional changes.
[ bp: Write @ret in find_blobs_in_containers() only on success. ]
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241018155151.702350-2-nik.borisov@suse.com
Convert family/model mixed checks to VFM-based checks to make the code
more compact. Simplify.
[ bp: Drop the "what" from the commit message - it should be visible from
the diff alone. ]
Suggested-by: Sohil Mehta <sohil.mehta@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20241212140103.66964-6-qiuxu.zhuo@intel.com
Make those functions whose callers only care about success or failure return
a boolean value for better readability. Also, update the call sites
accordingly as the polarities of all the return values have been flipped.
No functional changes.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20241212140103.66964-4-qiuxu.zhuo@intel.com
The 'storm' variable points to this_cpu_ptr(&storm_desc). Access the
'stormy_bank_count' field through the 'storm' to avoid calling
this_cpu_*() on the same per-CPU variable twice.
This minor optimization reduces the text size by 16 bytes.
$ size threshold.o.*
text data bss dec hex filename
1395 1664 0 3059 bf3 threshold.o.old
1379 1664 0 3043 be3 threshold.o.new
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20241212140103.66964-3-qiuxu.zhuo@intel.com
Make several functions that return 0 or 1 return a boolean value for
better readability.
No functional changes are intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20241212140103.66964-2-qiuxu.zhuo@intel.com
If the machine has:
CPUID Fn8000_0021_EAX[30] (SRSO_USER_KERNEL_NO) -- If this bit is 1,
it indicates the CPU is not subject to the SRSO vulnerability across
user/kernel boundaries.
have it fall back to IBPB on VMEXIT only, in the case it is going to run
VMs:
Speculative Return Stack Overflow: Mitigation: IBPB on VMEXIT only
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20241202120416.6054-2-bp@kernel.org
An indirect branch instruction sets the CPU indirect branch tracker
(IBT) into WAIT_FOR_ENDBRANCH (WFE) state and WFE stays asserted
across the instruction boundary. When the decoder finds an
inappropriate instruction while WFE is set ENDBR, the CPU raises a #CP
fault.
For the "kernel IBT no ENDBR" selftest where #CPs are deliberately
triggered, the WFE state of the interrupted context needs to be
cleared to let execution continue. Otherwise when the CPU resumes
from the instruction that just caused the previous #CP, another
missing-ENDBRANCH #CP is raised and the CPU enters a dead loop.
This is not a problem with IDT because it doesn't preserve WFE and
IRET doesn't set WFE. But FRED provides space on the entry stack
(in an expanded CS area) to save and restore the WFE state, thus the
WFE state is no longer clobbered, so software must clear it.
Clear WFE to avoid dead looping in ibt_clear_fred_wfe() and the
!ibt_fatal code path when execution is allowed to continue.
Clobbering WFE in any other circumstance is a security-relevant bug.
[ dhansen: changelog rewording ]
Fixes: a5f6c2ace9 ("x86/shstk: Add user control-protection fault handler")
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241113175934.3897541-1-xin%40zytor.com
Pass ftrace_regs to the fgraph_ops::entryfunc(). If ftrace_regs is not
available, it passes a NULL instead. User callback function can access
some registers (including return address) via this ftrace_regs.
Note that the ftrace_regs can be NULL when the arch does NOT define:
HAVE_DYNAMIC_FTRACE_WITH_ARGS or HAVE_DYNAMIC_FTRACE_WITH_REGS.
More specifically, if HAVE_DYNAMIC_FTRACE_WITH_REGS is defined but
not the HAVE_DYNAMIC_FTRACE_WITH_ARGS, and the ftrace ops used to
register the function callback does not set FTRACE_OPS_FL_SAVE_REGS.
In this case, ftrace_regs can be NULL in user callback.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173518990044.391279.17406984900626078579.stgit@devnote2
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Get the ftrace recursion lock in the generic function_graph_enter()
instead of each architecture code.
This changes all function_graph tracer callbacks running in
non-preemptive state. On x86 and powerpc, this is by default, but
on the other architecutres, this will be new.
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173379653720.973433.18438622234884980494.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmdhxFsTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXmGuB/0ZsXKBh9eCZnmVSZwdVkcSeE/HEHHc
K5qeIxjqirAzajvLbAtgnTQpi5w0kLYYOT/X/8b6L1nnefcfk1cz5cbyQKxXx7J6
Zuob5KpjX7FtJbMc7QQtzrLyApw2k2OYe1QPCRxuWkYzQjaus/6kM27ivjjYFqDs
Qv6IKCcebomAbJTN8IwF38KsFQ2JS3moWr+kcVyna7+Kg1ymEto2QlnI01Z+M1n7
BUYvLgYUD/5HzbNjqC0HysUuX5PsVQ45OsmTgjkvs/XzLcAGGEHJGrJHJkEccx7H
AsiNzHoB6wUrFhCAX7IPCSE0g9vjtwI21ozqYllzOTa/Q5KnUE+q9fhJ
=Fw3U
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed-20241217' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Various fixes to Hyper-V tools in the kernel tree (Dexuan Cui, Olaf
Hering, Vitaly Kuznetsov)
- Fix a bug in the Hyper-V TSC page based sched_clock() (Naman Jain)
- Two bug fixes in the Hyper-V utility functions (Michael Kelley)
- Convert open-coded timeouts to secs_to_jiffies() in Hyper-V drivers
(Easwar Hariharan)
* tag 'hyperv-fixes-signed-20241217' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
tools/hv: reduce resource usage in hv_kvp_daemon
tools/hv: add a .gitignore file
tools/hv: reduce resouce usage in hv_get_dns_info helper
hv/hv_kvp_daemon: Pass NIC name to hv_get_dns_info as well
Drivers: hv: util: Avoid accessing a ringbuffer not initialized yet
Drivers: hv: util: Don't force error code to ENODEV in util_probe()
tools/hv: terminate fcopy daemon if read from uio fails
drivers: hv: Convert open-coded timeouts to secs_to_jiffies()
tools: hv: change permissions of NetworkManager configuration file
x86/hyperv: Fix hv tsc page based sched_clock for hibernation
tools: hv: Fix a complier warning in the fcopy uio daemon
The CPUID level dependency table will entirely zap X86_FEATURE_XSAVE
if the CPUID level is too low. This code is unreachable. Kill it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/all/20241213205038.6E71F9A4%40davehans-spike.ostc.intel.com
Move the XSAVE-related CPUID leaf definitions to common code. Then,
use the new definition to remove the last magic number from the CPUID
level dependency table.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205037.43C57CDE%40davehans-spike.ostc.intel.com
The DCA leaf number is also hard-coded in the CPUID level dependency
table. Move its definition to common code and use it.
While at it, fix up the naming and types in the probe code. All
CPUID data is provided in 32-bit registers, not 'unsigned long'.
Also stop referring to "level_9". Move away from test_bit()
because the type is no longer an 'unsigned long'.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205032.476A30FE%40davehans-spike.ostc.intel.com
The CPUID leaf dependency checker will remove X86_FEATURE_MWAIT if
the CPUID level is below the required level (CPUID_MWAIT_LEAF).
Thus, if you check X86_FEATURE_MWAIT you do not need to also
check the CPUID level.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213205030.9B42B458%40davehans-spike.ostc.intel.com
The AMD erratum 1386 detection code uses and old style 'x86_cpu_desc'
table. Replace it with 'x86_cpu_id' so the old style can be removed.
I did not create a new helper macro here. The new table is certainly
more noisy than the old and it can be improved on. But I was hesitant
to create a new macro just for a single site that is only two ugly
lines in the end.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185132.07555E1D%40davehans-spike.ostc.intel.com
The x86_match_cpu() infrastructure can match CPU steppings. Since
there are only 16 possible steppings, the matching infrastructure goes
all out and stores the stepping match as a bitmap. That means it can
match any possible steppings in a single list entry. Fun.
But it exposes this bitmap to each of the X86_MATCH_*() helpers when
none of them really need a bitmap. It makes up for this by exporting a
helper (X86_STEPPINGS()) which converts a contiguous stepping range
into the bitmap which every single user leverages.
Instead of a bitmap, have the main helper for this sort of thing
(X86_MATCH_VFM_STEPS()) just take a stepping range. This ends up
actually being even more compact than before.
Leave the helper in place (renamed to __X86_STEPPINGS()) to make it
more clear what is going on instead of just having a random GENMASK()
in the middle of an already complicated macro.
One oddity that I hit was this macro:
X86_MATCH_VFM_STEPS(vfm, X86_STEPPING_MIN, max_stepping, issues)
It *could* have been converted over to take a min/max stepping value
for each entry. But that would have been a bit too verbose and would
prevent the one oddball in the list (INTEL_COMETLAKE_L stepping 0)
from sticking out.
Instead, just have it take a *maximum* stepping and imply that the match
is from 0=>max_stepping. This is functional for all the cases now and
also retains the nice property of having INTEL_COMETLAKE_L stepping 0
stick out like a sore thumb.
skx_cpuids[] is goofy. It uses the stepping match but encodes all
possible steppings. Just use a normal, non-stepping match helper.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185129.65527B2A%40davehans-spike.ostc.intel.com
The 'x86_cpu_id' and 'x86_cpu_desc' structures are very similar and
need to be consolidated. There is a microcode version matching
function for 'x86_cpu_desc' but not 'x86_cpu_id'.
Create one for 'x86_cpu_id'.
This essentially just leverages the x86_cpu_id->driver_data field
to replace the less generic x86_cpu_desc->x86_microcode_rev field.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185128.8F24EEFC%40davehans-spike.ostc.intel.com
The hypercall page is no longer needed. It can be removed, as from the
Xen perspective it is optional.
But, from Linux's perspective, it removes naked RET instructions that
escape the speculative protections that Call Depth Tracking and/or
Untrain Ret are trying to achieve.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Limit usage of the non-architectural RMP format to Zen3/Zen4 processors.
The RMPREAD instruction, with architectural defined output, is available
and should be used for RMP access beyond Zen4.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Reviewed-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/5be0093e091778a151266ea853352f62f838eb99.1733172653.git.thomas.lendacky@amd.com
Add static_call_update_early() for updating static-call targets in
very early boot.
This will be needed for support of Xen guest type specific hypercall
functions.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
In order to be able to differentiate between AMD and Intel based
systems for very early hypercalls without having to rely on the Xen
hypercall page, make get_cpu_vendor() non-static.
Refactor early_cpu_init() for the same reason by splitting out the
loop initializing cpu_devs() into an externally callable function.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
The "mba_MBps" mount option provides an alternate method to control memory
bandwidth. Instead of specifying allowable bandwidth as a percentage of
maximum possible, the user provides a MiB/s limit value.
There is a file in each CTRL_MON group directory that shows the event
currently in use.
Allow writing that file to choose a different event.
A user can choose any of the memory bandwidth monitoring events listed in
/sys/fs/resctrl/info/L3_mon/mon_features independently for each CTRL_MON group
by writing to each of the "mba_MBps_event" files.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20241206163148.83828-8-tony.luck@intel.com
The "mba_MBps" mount option provides an alternate method to control memory
bandwidth. Instead of specifying allowable bandwidth as a percentage of
maximum possible, the user provides a MiB/s limit value.
In preparation to allow the user to pick the memory bandwidth monitoring event
used as input to the feedback loop, provide a file in each CTRL_MON group
directory that shows the event currently in use. Note that this file is only
visible when the "mba_MBps" mount option is in use.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20241206163148.83828-7-tony.luck@intel.com
These two commits interact:
upstream: 73da582a47 ("x86/cpu/topology: Remove limit of CPUs due to disabled IO/APIC")
x86/cleanups: 13148e22c1 ("x86/apic: Remove "disablelapic" cmdline option")
Resolve it.
Conflicts:
arch/x86/kernel/cpu/topology.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The convention is "no<something>" and there already is "nolapic". Drop
the disable one.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20241202190011.11979-2-bp@kernel.org
Documentation/arch/x86/x86_64/boot-options.rst is causing unnecessary
confusion by being a second place where one can put x86 boot options.
Move them into the main one.
Drop removed ones like "acpi=ht", while at it.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20241202190011.11979-1-bp@kernel.org
The default input measurement to the mba_sc feedback loop for memory bandwidth
control when the user mounts with the "mba_MBps" option is the local bandwidth
event. But some systems may not support a local bandwidth event.
When local bandwidth event is not supported, check for support of total
bandwidth and use that instead.
Relax the mount option check to allow use of the "mba_MBps" option for systems
when only total bandwidth monitoring is supported. Also update the error
message.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20241206163148.83828-6-tony.luck@intel.com
Switching between local and total memory bandwidth events as the input
to the mba_sc feedback loop would be cumbersome and take effect slowly
in the current implementation as the bandwidth is only known after two
consecutive readings of the same event.
Compute the bandwidth for all supported events. This doesn't add
significant overhead and will make changing which event is used
simple.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20241206163148.83828-5-tony.luck@intel.com
In __startup_64(), the bool 'la57' can only assume the 'true' value if
CONFIG_X86_5LEVEL is enabled in the build, and generally, the compiler
can make this inference at build time, and elide any references to the
symbol 'level4_kernel_pgt', which may be undefined if 'la57' is false.
As it turns out, GCC 12 gets this wrong sometimes, and gives up with a
build error:
ld: arch/x86/kernel/head64.o: in function `__startup_64':
head64.c:(.head.text+0xbd): undefined reference to `level4_kernel_pgt'
even though the reference is in unreachable code. Fix this by
duplicating the IS_ENABLED(CONFIG_X86_5LEVEL) in the conditional that
tests the value of 'la57'.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241209094105.762857-2-ardb+git@google.com
Closes: https://lore.kernel.org/oe-kbuild-all/202412060403.efD8Kgb7-lkp@intel.com/
update_mba_bw() hard codes use of the memory bandwidth local event which
prevents more flexible options from being deployed.
Change this function to use the event specified in the rdtgroup that is
being processed.
Mount time checks for the "mba_MBps" option ensure that local memory
bandwidth is enabled. So drop the redundant is_mbm_local_enabled() check.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20241206163148.83828-4-tony.luck@intel.com
Resctrl uses local memory bandwidth event as input to the feedback loop when
the mba_MBps mount option is used. This means that this mount option cannot be
used on systems that only support monitoring of total bandwidth.
Prepare to allow users to choose the input event independently for each
CTRL_MON group by adding a global variable "mba_mbps_default_event" used to
set the default event for each CTRL_MON group, and a new field
"mba_mbps_event" in struct rdtgroup to track which event is used for each
CTRL_MON group.
Notes:
1) Both of these are only used when the user mounts the filesystem with the
"mba_MBps" option.
2) Only check for support of local bandwidth event when initializing
mba_mbps_default_event. Support for total bandwidth event can be added
after other routines in resctrl have been updated to handle total bandwidth
event.
[ bp: Move mba_mbps_default_event extern into the arch header. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20241206163148.83828-3-tony.luck@intel.com
thread_throttle_mode_init() and mbm_config_rftype_init() both initialize
fflags for resctrl files.
Adding new files will involve adding another function to initialize
the fflags. This can be simplified by adding a new function
resctrl_file_fflags_init() and passing the file name and flags
to be initialized.
Consolidate fflags initialization into resctrl_file_fflags_init() and
remove thread_throttle_mode_init() and mbm_config_rftype_init().
[ Tony: Drop __init attribute so resctrl_file_fflags_init() can be used at
run time. ]
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20241206163148.83828-2-tony.luck@intel.com
Use the proper API instead of open coding it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240807160228.26206-3-frederic@kernel.org
read_hv_sched_clock_tsc() assumes that the Hyper-V clock counter is
bigger than the variable hv_sched_clock_offset, which is cached during
early boot, but depending on the timing this assumption may be false
when a hibernated VM starts again (the clock counter starts from 0
again) and is resuming back (Note: hv_init_tsc_clocksource() is not
called during hibernation/resume); consequently,
read_hv_sched_clock_tsc() may return a negative integer (which is
interpreted as a huge positive integer since the return type is u64)
and new kernel messages are prefixed with huge timestamps before
read_hv_sched_clock_tsc() grows big enough (which typically takes
several seconds).
Fix the issue by saving the Hyper-V clock counter just before the
suspend, and using it to correct the hv_sched_clock_offset in
resume. This makes hv tsc page based sched_clock continuous and ensures
that post resume, it starts from where it left off during suspend.
Override x86_platform.save_sched_clock_state and
x86_platform.restore_sched_clock_state routines to correct this as soon
as possible.
Note: if Invariant TSC is available, the issue doesn't happen because
1) we don't register read_hv_sched_clock_tsc() for sched clock:
See commit e5313f1c54 ("clocksource/drivers/hyper-v: Rework
clocksource and sched clock setup");
2) the common x86 code adjusts TSC similarly: see
__restore_processor_state() -> tsc_verify_tsc_adjust(true) and
x86_platform.restore_sched_clock_state().
Cc: stable@vger.kernel.org
Fixes: 1349401ff1 ("clocksource/drivers/hyper-v: Suspend/resume Hyper-V clocksource for hibernation")
Co-developed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20240917053917.76787-1-namjain@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240917053917.76787-1-namjain@linux.microsoft.com>
Build 6.13-rc12 for x86_64 with gcc 14.2.1 fails with the error:
ld: vmlinux.o: in function `virtual_mapped':
linux/arch/x86/kernel/relocate_kernel_64.S:249:(.text+0x5915b): undefined reference to `saved_context_gdt_desc'
when CONFIG_KEXEC_JUMP is enabled.
This was introduced by commit 07fa619f2a ("x86/kexec: Restore GDT on
return from ::preserve_context kexec") which introduced a use of
saved_context_gdt_desc without a declaration for it.
Fix that by including asm/asm-offsets.h where saved_context_gdt_desc
is defined (indirectly in include/generated/asm-offsets.h which
asm/asm-offsets.h includes).
Fixes: 07fa619f2a ("x86/kexec: Restore GDT on return from ::preserve_context kexec")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Closes: https://lore.kernel.org/oe-kbuild-all/202411270006.ZyyzpYf8-lkp@intel.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename the helper to better reflect its function.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20241202073139.448208-1-kirill.shutemov@linux.intel.com
When ensuring EFER.AUTOIBRS is set, WARN only on a negative return code
from msr_set_bit(), as '1' is used to indicate the WRMSR was successful
('0' indicates the MSR bit was already set).
Fixes: 8cc68c9c9e ("x86/CPU/AMD: Make sure EFER[AIBRSE] is set")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/Z1MkNofJjt7Oq0G6@google.com
Closes: https://lore.kernel.org/all/20241205220604.GA2054199@thelio-3990X
Linux remembers cpu_cachinfo::num_leaves per CPU, but x86 initializes all
CPUs from the same global "num_cache_leaves".
This is erroneous on systems such as Meteor Lake, where each CPU has a
distinct num_leaves value. Delete the global "num_cache_leaves" and
initialize num_leaves on each CPU.
init_cache_level() no longer needs to set num_leaves. Also, it never had to
set num_levels as it is unnecessary in x86. Keep checking for zero cache
leaves. Such condition indicates a bug.
[ bp: Cleanup. ]
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org # 6.3+
Link: https://lore.kernel.org/r/20241128002247.26726-3-ricardo.neri-calderon@linux.intel.com
The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241202-sysfs-const-bin_attr-x86-v1-1-b767d5f0ac5c@weissschuh.net
The pv_ops::cpu.wbinvd paravirt callback is a leftover of lguest times.
Today it is no longer needed, as all users use the native WBINVD
implementation.
Remove the callback and rename native_wbinvd() to wbinvd().
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241203071550.26487-1-jgross@suse.com
Linux defined feature bits X86_FEATURE_P3 and X86_FEATURE_P4 are not
used anywhere. Commit f31d731e44 ("x86: use X86_FEATURE_NOPL in
alternatives") got rid of the last usage in 2008. Remove the related
mappings and code.
Just like all X86_FEATURE bits, the raw bit numbers can be exposed to
userspace via MODULE_DEVICE_TABLE(). There is a very small theoretical
chance of userspace getting confused if these bits got reassigned and
changed logical meaning. But these bits were never used for a device
table, so it's highly unlikely this will ever happen in practice.
[ dhansen: clarify userspace visibility of these bits ]
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/20241107233000.2742619-1-sohil.mehta%40intel.com
All writes to the page now happen before it gets marked as executable
(or after it's already switched to the identmap page tables where it's
OK to be RWX).
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20241205153343.3275139-14-dwmw2@infradead.org
The memory encryption flag is passed in %r8 because that's where the
calling convention puts it. Instead of moving it to %r12 and then using
%r8 for other things, just leave it in %r8 and use other registers
instead.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-13-dwmw2@infradead.org
All writes to the relocate_kernel control page are now done *after* the
%cr3 switch via simple %rip-relative addressing, which means the DATA()
macro with its pointer arithmetic can also now be removed.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-12-dwmw2@infradead.org
The kernel's virtual mapping of the relocate_kernel page currently needs
to be RWX because it is written to before the %cr3 switch.
Now that the relocate_kernel page has its own .data section and local
variables, it can also have *global* variables. So eliminate the separate
page_list argument, and write the same information directly to variables
in the relocate_kernel page instead. This way, the relocate_kernel code
itself doesn't need to copy it.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-11-dwmw2@infradead.org
Now that the relocate_kernel page is handled sanely by a linker script
we can have actual data, and just use %rip-relative addressing to access
it.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-10-dwmw2@infradead.org
Now that the copy is executed instead of the original, the relocate_kernel
page can live in the kernel's .text section. This will allow subsequent
commits to actually add real data to it and clean up the code somewhat as
well as making the control page ROX.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-9-dwmw2@infradead.org
This currently calls set_memory_x() from machine_kexec_prepare() just
like the 32-bit version does. That's actually a bit earlier than I'd
like, as it leaves the page RWX all the time the image is even *loaded*.
Subsequent commits will eliminate all the writes to the page between the
point it's marked executable in machine_kexec_prepare() the time that
relocate_kernel() is running and has switched to the identmap %cr3, so
that it can be ROX. But that can't happen until it's moved to the .data
section of the kernel, and *that* can't happen until we start executing
the copy instead of executing it in place in the kernel .text. So break
the circular dependency in those commits by letting it be RWX for now.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-8-dwmw2@infradead.org
There's no need for this to wait until the actual machine_kexec() invocation;
future changes will need to make the control page read-only and executable,
so all writes should be completed before machine_kexec_prepare() returns.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-7-dwmw2@infradead.org
Now that the following fix:
d0ceea662d ("x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables")
stops kernel_ident_mapping_init() from scribbling over the end of a
4KiB PGD by assuming the following 4KiB will be a userspace PGD,
there's no good reason for the kexec PGD to be part of a single
8KiB allocation with the control_code_page.
( It's not clear that that was the reason for x86_64 kexec doing it that
way in the first place either; there were no comments to that effect and
it seems to have been the case even before PTI came along. It looks like
it was just a happy accident which prevented memory corruption on kexec. )
Either way, it definitely isn't needed now. Just allocate the PGD
separately on x86_64, like i386 already does.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-6-dwmw2@infradead.org
There's no need to swap pages (which involves three memcopies for each
page) in the plain kexec case. Just do a single copy from source to
destination page.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-5-dwmw2@infradead.org