Add a new feature bit that indicates support for workload-based heuristic
feedback to OS for scheduling decisions.
When the bit set, threads are classified during runtime into enumerated
classes. The classes represent thread performance/power characteristics
that may benefit from special scheduling behaviors.
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Link: https://lore.kernel.org/r/20241028020251.8085-4-mario.limonciello@amd.com
allocated to a CoCo (SNP) guest which cannot use them and thus fail booting
- Fix the microcode loader on AMD to pay attention to the stepping of a patch
and to handle the case where a BIOS config option splits the machine into
logical NUMA nodes per L3 cache slice
- Disable LAM from being built by default due to security concerns of
a various kind
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmceGS0ACgkQEsHwGGHe
VUqf4w/+JEzle0DbXRTCB1Gu9ID7mTuEGb3tXjK+UOTy8nXRyAf9BqgyzJeCr3gu
0vCuNzhOTe4sfKb+bNp2yy36/c7RodAGuot1oIWXf8hiMtWCIZ1rcf2zj4GqzSmD
FUAPexX/FDkySLQ3FOfTmpDwGgDFe3IH6rMn8ETkwAuZsh2aiYNYlUjq5AZNQjIh
Fa3eyYBSpCppdSeVLxzq1fnbFGIg25AYiXRsWzoulwkeARHadvc0lopPIumkUbUw
zyYWt1CrcsQTahwSF3Yt2dstve2yWHtbmElH8N4X3dvKsoP2OoVM/glVDWf2yaT+
3dkh/OrAnqHX8CSZCakPVhpHg5VDZewkzyfMSykge2itu2J5+780Cjq9PJ4A1PV6
oUx/wfhso16Fkn9VhXaMUcO+GJB2uOKCtktCXt6cIBnRQiSR1ka/X7duuEZbdvFA
jiVy4KrKYnvqJHlz5GFg3FqfvUWEzFDP8dZNuJb+eqJhHo1C0gWyOYhXd9nZeyF9
ZA5nYTp/mkf9UXhhEYAHV+qnEQYIi4yOXoICQezc5PxCnnxQrJ4+Z1FV1bO0RLN6
FCkqGn2aSwoiUfDfWw899juDO+B+aYqwTPy7gcZBex8qnJYsB1BUyPOcARnE2Wqs
6S3dvm2Uq78mtxBvVSlLpwxV4ZHM9ZbSoAx/sasNuOzdTjMPHS0=
=P9TV
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.12_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Prevent a certain range of pages which get marked as hypervisor-only,
to get allocated to a CoCo (SNP) guest which cannot use them and thus
fail booting
- Fix the microcode loader on AMD to pay attention to the stepping of a
patch and to handle the case where a BIOS config option splits the
machine into logical NUMA nodes per L3 cache slice
- Disable LAM from being built by default due to security concerns
* tag 'x86_urgent_for_v6.12_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Ensure that RMP table fixups are reserved
x86/microcode/AMD: Split load_microcode_amd()
x86/microcode/AMD: Pay attention to the stepping dynamically
x86/lam: Disable ADDRESS_MASKING in most cases
AMD heterogeneous designs include two types of cores:
* Performance
* Efficiency
Each core type has different highest performance values configured by the
platform. Drivers such as amd_pstate need to identify the type of core to
correctly set an appropriate boost numerator to calculate the maximum
frequency.
X86_FEATURE_AMD_HETEROGENEOUS_CORES is used to identify whether the SoC
supports heterogeneous core type by reading CPUID leaf Fn_0x80000026.
On performance cores the scaling factor of 196 is used. On efficiency cores
the scaling factor is the value reported as the highest perf. Efficiency
cores have the same preferred core rankings.
Suggested-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241025171459.1093-6-mario.limonciello@amd.com
Sometimes it is required to take actions based on if a CPU is a performance or
efficiency core. As an example, intel_pstate driver uses the Intel core-type
to determine CPU scaling. Also, some CPU vulnerabilities only affect
a specific CPU type, like RFDS only affects Intel Atom. Hybrid systems that
have variants P+E, P-only(Core) and E-only(Atom), it is not straightforward to
identify which variant is affected by a type specific vulnerability.
Such processors do have CPUID field that can uniquely identify them. Like,
P+E, P-only and E-only enumerates CPUID.1A.CORE_TYPE identification, while P+E
additionally enumerates CPUID.7.HYBRID. Based on this information, it is
possible for boot CPU to identify if a system has mixed CPU types.
Add a new field hw_cpu_type to struct cpuinfo_topology that stores the
hardware specific CPU type. This saves the overhead of IPIs to get the CPU
type of a different CPU. CPU type is populated early in the boot process,
before vulnerabilities are enumerated.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Co-developed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20241025171459.1093-5-mario.limonciello@amd.com
Enable the SD_ASYM_PACKING domain flag for the PKG domain on AMD heterogeneous
processors. This flag is beneficial for processors with one or more CCDs and
relies on x86_sched_itmt_flags().
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Co-developed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Link: https://lore.kernel.org/r/20241025171459.1093-4-mario.limonciello@amd.com
CPUID leaf 0x80000026 advertises core types with different efficiency
rankings.
Bit 30 indicates the heterogeneous core topology feature, if the bit
set, it means not all instances at the current hierarchical level have
the same core topology.
This is described in the AMD64 Architecture Programmers Manual Volume
2 and 3, doc ID #25493 and #25494.
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Co-developed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241025171459.1093-3-mario.limonciello@amd.com
It turns out that AMD has a "Meltdown Lite(tm)" issue with non-canonical
accesses in kernel space. And so using just the high bit to decide
whether an access is in user space or kernel space ends up with the good
old "leak speculative data" if you have the right gadget using the
result:
CVE-2020-12965 “Transient Execution of Non-Canonical Accesses“
Now, the kernel surrounds the access with a STAC/CLAC pair, and those
instructions end up serializing execution on older Zen architectures,
which closes the speculation window.
But that was true only up until Zen 5, which renames the AC bit [1].
That improves performance of STAC/CLAC a lot, but also means that the
speculation window is now open.
Note that this affects not just the new address masking, but also the
regular valid_user_address() check used by access_ok(), and the asm
version of the sign bit check in the get_user() helpers.
It does not affect put_user() or clear_user() variants, since there's no
speculative result to be used in a gadget for those operations.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Link: https://lore.kernel.org/all/80d94591-1297-4afb-b510-c665efd37f10@citrix.com/
Link: https://lore.kernel.org/all/20241023094448.GAZxjFkEOOF_DM83TQ@fat_crate.local/ [1]
Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html
Link: https://arxiv.org/pdf/2108.10771
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> # LAM case
Fixes: 2865baf540 ("x86: support user address masking instead of non-speculative conditional")
Fixes: 6014bc2756 ("x86-64: make access_ok() independent of LAM")
Fixes: b19b74bc99 ("x86/mm: Rework address range check in get_user() and put_user()")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, an unconditional cache flush is performed during every
microcode update. Although the original changelog did not mention
a specific erratum, this measure was primarily intended to address
a specific microcode bug, the load of which has already been blocked by
is_blacklisted(). Therefore, this cache flush is no longer necessary.
Additionally, the side effects of doing this have been overlooked. It
increases CPU rendezvous time during late loading, where the cache flush
takes between 1x to 3.5x longer than the actual microcode update.
Remove native_wbinvd() and update the erratum name to align with the
latest errata documentation, document ID 334163 Version 022US.
[ bp: Zap the flaky documentation URL. ]
Fixes: 91df9fdf51 ("x86/microcode/intel: Writeback and invalidate caches before updating microcode")
Reported-by: Yan Hua Wu <yanhua1.wu@intel.com>
Reported-by: William Xie <william.xie@intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Tested-by: Yan Hua Wu <yanhua1.wu@intel.com>
Link: https://lore.kernel.org/r/20241001161042.465584-2-chang.seok.bae@intel.com
This function should've been split a long time ago because it is used in
two paths:
1) On the late loading path, when the microcode is loaded through the
request_firmware interface
2) In the save_microcode_in_initrd() path which collects all the
microcode patches which are relevant for the current system before
the initrd with the microcode container has been jettisoned.
In that path, it is not really necessary to iterate over the nodes on
a system and match a patch however it didn't cause any trouble so it
was left for a later cleanup
However, that later cleanup was expedited by the fact that Jens was
enabling "Use L3 as a NUMA node" in the BIOS setting in his machine and
so this causes the NUMA CPU masks used in cpumask_of_node() to be
generated *after* 2) above happened on the first node. Which means, all
those masks were funky, wrong, uninitialized and whatnot, leading to
explosions when dereffing c->microcode in load_microcode_amd().
So split that function and do only the necessary work needed at each
stage.
Fixes: 94838d230a ("x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/91194406-3fdf-4e38-9838-d334af538f74@kernel.dk
Commit in Fixes changed how a microcode patch is loaded on Zen and newer but
the patch matching needs to happen with different rigidity, depending on what
is being done:
1) When the patch is added to the patches cache, the stepping must be ignored
because the driver still supports different steppings per system
2) When the patch is matched for loading, then the stepping must be taken into
account because each CPU needs the patch matching its exact stepping
Take care of that by making the matching smarter.
Fixes: 94838d230a ("x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/91194406-3fdf-4e38-9838-d334af538f74@kernel.dk
* Fix the guest view of the ID registers, making the relevant fields
writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1)
* Correcly expose S1PIE to guests, fixing a regression introduced
in 6.12-rc1 with the S1POE support
* Fix the recycling of stage-2 shadow MMUs by tracking the context
(are we allowed to block or not) as well as the recycling state
* Address a couple of issues with the vgic when userspace misconfigures
the emulation, resulting in various splats. Headaches courtesy
of our Syzkaller friends
* Stop wasting space in the HYP idmap, as we are dangerously close
to the 4kB limit, and this has already exploded in -next
* Fix another race in vgic_init()
* Fix a UBSAN error when faking the cache topology with MTE
enabled
RISCV:
* RISCV: KVM: use raw_spinlock for critical section in imsic
x86:
* A bandaid for lack of XCR0 setup in selftests, which causes trouble
if the compiler is configured to have x86-64-v3 (with AVX) as the
default ISA. Proper XCR0 setup will come in the next merge window.
* Fix an issue where KVM would not ignore low bits of the nested CR3
and potentially leak up to 31 bytes out of the guest memory's bounds
* Fix case in which an out-of-date cached value for the segments could
by returned by KVM_GET_SREGS.
* More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL
* Override MTRR state for KVM confidential guests, making it WB by
default as is already the case for Hyper-V guests.
Generic:
* Remove a couple of unused functions
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmcVK54UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOfrgf7BRyihd28OGaqVuv2BqGYrxqfOkd6
ZqpJDOy+X7UE3iG5NhTxw4mghCJFhOwIL7gDSZwPLe6D2k01oqPSP2pLMqXb5oOv
/EkltRvzG0YIH3sjZY5PROrMMxnvSKkJKxETFxFQQzMKRym2v/T5LAzrium58YIT
vWZXxo2HTPXOw/U5upAqqMYJMeeJEL3kurVHtOsPytUFjrIOl0BfeKvgjOwonDIh
Awm4JZwk0+1d8sYfkuzsSrTQmtshDCx1jkFN1juirt90s1EwgmOvVKiHo3gMsVP9
veDRoLTx2fM/r7TrhoHo46DTA2vbfmCltWcT0cn5x8P24BFGXXe/IDJIHA==
=IVlI
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix the guest view of the ID registers, making the relevant fields
writable from userspace (affecting ID_AA64DFR0_EL1 and
ID_AA64PFR1_EL1)
- Correcly expose S1PIE to guests, fixing a regression introduced in
6.12-rc1 with the S1POE support
- Fix the recycling of stage-2 shadow MMUs by tracking the context
(are we allowed to block or not) as well as the recycling state
- Address a couple of issues with the vgic when userspace
misconfigures the emulation, resulting in various splats. Headaches
courtesy of our Syzkaller friends
- Stop wasting space in the HYP idmap, as we are dangerously close to
the 4kB limit, and this has already exploded in -next
- Fix another race in vgic_init()
- Fix a UBSAN error when faking the cache topology with MTE enabled
RISCV:
- RISCV: KVM: use raw_spinlock for critical section in imsic
x86:
- A bandaid for lack of XCR0 setup in selftests, which causes trouble
if the compiler is configured to have x86-64-v3 (with AVX) as the
default ISA. Proper XCR0 setup will come in the next merge window.
- Fix an issue where KVM would not ignore low bits of the nested CR3
and potentially leak up to 31 bytes out of the guest memory's
bounds
- Fix case in which an out-of-date cached value for the segments
could by returned by KVM_GET_SREGS.
- More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL
- Override MTRR state for KVM confidential guests, making it WB by
default as is already the case for Hyper-V guests.
Generic:
- Remove a couple of unused functions"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits)
RISCV: KVM: use raw_spinlock for critical section in imsic
KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups
KVM: selftests: x86: Avoid using SSE/AVX instructions
KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()
KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL
KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
x86/kvm: Override default caching mode for SEV-SNP and TDX
KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic
KVM: Remove unused kvm_vcpu_gfn_to_pfn
KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration
KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS
KVM: arm64: Fix shift-out-of-bounds bug
KVM: arm64: Shave a few bytes from the EL2 idmap code
KVM: arm64: Don't eagerly teardown the vgic on init error
KVM: arm64: Expose S1PIE to guests
KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule
KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
...
some CPU errata in that area
- Do not apply the Zenbleed fix on anything else except AMD Zen2 on the
late microcode loading path
- Clear CPU buffers later in the NMI exit path on 32-bit to avoid
register clearing while they still contain sensitive data, for the
RDFS mitigation
- Do not clobber EFLAGS.ZF with VERW on the opportunistic SYSRET exit
path on 32-bit
- Fix parsing issues of memory bandwidth specification in sysfs for
resctrl's memory bandwidth allocation feature
- Other small cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmcU6aMACgkQEsHwGGHe
VUqXPxAAjG0m9J11jBNlNsorPKe0dlhkgV6RpEOtCWov0mvxSAPQazT9PE0FTCvx
Hm/IdEmj5vkkJOC/R7pga8Yz5fRwGtYwIHyS5618Wh+KAfdsXDgTFvCKaBQt0ltB
9U5+mwmyzzL6rS6jcv/y28qwi0STd4dHKg6K9sWAtga1bQSPCyJMZjeh9op5CxNh
QOppCJR23jrp9I9c1zFd1LJPM4GY+KTYXTa7076sfcoD2taHbxAwsC/wiMooh5A2
k0EItyzy2UWWSUxAW8QhZJyuAWav631tHjcz9iETgNZmjgpR0sTGFGkRaYB74qkf
vS2yyGpTSoKhxXVcBe7Z6cMf5DhUUjMa7itXZnY7kWCenvwfa3/nuSUKtIeqTPyg
a6BXypPFyYaqRWHtCiN6KjwXaS+fbc385Fh6m8Q/NDrHnXG84oLQ3DK0WKj4Z37V
YRflsWJ4ZRIwLALGsKJX+qbe9Oh3VDE3Q8MH9pCiJi227YB2OzyImJmCUBRY9bIC
7Amw4aUBUxX/VUpUOC4CJnx8SOG7cIeM06E6jM7J6LgWHpee++ccbFpZNqFh3VW/
j67AifRJFljG+JcyPLZxZ4M/bzpsGkpZ7iiW8wI8k0CPoG7lcvbkZ3pQ4eizAHIJ
0a+WQ9jHj1/64g4bT7Ml8lZRbzfBG/ksLkRwq8Gakt+h7GQbsd4=
=n0wZ
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Explicitly disable the TSC deadline timer when going idle to address
some CPU errata in that area
- Do not apply the Zenbleed fix on anything else except AMD Zen2 on the
late microcode loading path
- Clear CPU buffers later in the NMI exit path on 32-bit to avoid
register clearing while they still contain sensitive data, for the
RDFS mitigation
- Do not clobber EFLAGS.ZF with VERW on the opportunistic SYSRET exit
path on 32-bit
- Fix parsing issues of memory bandwidth specification in sysfs for
resctrl's memory bandwidth allocation feature
- Other small cleanups and improvements
* tag 'x86_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Always explicitly disarm TSC-deadline timer
x86/CPU/AMD: Only apply Zenbleed fix for Zen2 during late microcode load
x86/bugs: Use code segment selector for VERW operand
x86/entry_32: Clear CPU buffers after register restore in NMI return
x86/entry_32: Do not clobber user EFLAGS.ZF
x86/resctrl: Annotate get_mem_config() functions as __init
x86/resctrl: Avoid overflow in MB settings in bw_validate()
x86/amd_nb: Add new PCI ID for AMD family 1Ah model 20h
AMD SEV-SNP and Intel TDX have limited access to MTRR: either it is not
advertised in CPUID or it cannot be programmed (on TDX, due to #VE on
CR0.CD clear).
This results in guests using uncached mappings where it shouldn't and
pmd/pud_set_huge() failures due to non-uniform memory type reported by
mtrr_type_lookup().
Override MTRR state, making it WB by default as the kernel does for
Hyper-V guests.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Binbin Wu <binbin.wu@intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Message-ID: <20241015095818.357915-1-kirill.shutemov@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When arch_stack_walk_reliable() is called to unwind for newly forked
tasks, the return value is negative which means the call stack is
unreliable. This obviously does not meet expectations.
The root cause is that after commit 3aec4ecb3d ("x86: Rewrite
ret_from_fork() in C"), the 'ret_addr' of newly forked task is changed
to 'ret_from_fork_asm' (see copy_thread()), then at the start of the
unwind, it is incorrectly interprets not as a "signal" one because
'ret_from_fork' is still used to determine the initial "signal" (see
__unwind_start()). Then the address gets incorrectly decremented in the
call to orc_find() (see unwind_next_frame()) and resulting in the
incorrect ORC data.
To fix it, check 'ret_from_fork_asm' rather than 'ret_from_fork' in
__unwind_start().
Fixes: 3aec4ecb3d ("x86: Rewrite ret_from_fork() in C")
Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
When kernel IBT is enabled, objtool detects all text references in order
to determine which functions can be indirectly branched to.
In text, such references look like one of the following:
mov $0x0,%rax R_X86_64_32S .init.text+0x7e0a0
lea 0x0(%rip),%rax R_X86_64_PC32 autoremove_wake_function-0x4
Either way the function pointer is denoted by a relocation, so objtool
just reads that.
However there are some "lea xxx(%rip)" cases which don't use relocations
because they're referencing code in the same translation unit. Objtool
doesn't have visibility to those.
The only currently known instances of that are a few hand-coded asm text
references which don't actually need ENDBR. So it's not actually a
problem at the moment.
However if we enable -fpie, the compiler would start generating them and
there would definitely be bugs in the IBT sealing.
Detect non-relocated text references and handle them appropriately.
[ Note: I removed the manual static_call_tramp check -- that should
already be handled by the noendbr check. ]
Reported-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Use the irq_get_nr_irqs() and irq_set_nr_irqs() functions instead of the
global variable 'nr_irqs'. Prepare for changing 'nr_irqs' from an
exported global variable into a variable with file scope.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241015190953.1266194-7-bvanassche@acm.org
New processors have become pickier about the local APIC timer state
before entering low power modes. These low power modes are used (for
example) when you close your laptop lid and suspend. If you put your
laptop in a bag and it is not in this low power mode, it is likely
to get quite toasty while it quickly sucks the battery dry.
The problem boils down to some CPUs' inability to power down until the
CPU recognizes that the local APIC timer is shut down. The current
kernel code works in one-shot and periodic modes but does not work for
deadline mode. Deadline mode has been the supported and preferred mode
on Intel CPUs for over a decade and uses an MSR to drive the timer
instead of an APIC register.
Disable the TSC Deadline timer in lapic_timer_shutdown() by writing to
MSR_IA32_TSC_DEADLINE when in TSC-deadline mode. Also avoid writing
to the initial-count register (APIC_TMICT) which is ignored in
TSC-deadline mode.
Note: The APIC_LVTT|=APIC_LVT_MASKED operation should theoretically be
enough to tell the hardware that the timer will not fire in any of the
timer modes. But mitigating AMD erratum 411[1] also requires clearing
out APIC_TMICT. Solely setting APIC_LVT_MASKED is also ineffective in
practice on Intel Lunar Lake systems, which is the motivation for this
change.
1. 411 Processor May Exit Message-Triggered C1E State Without an Interrupt if Local APIC Timer Reaches Zero - https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/revision-guides/41322_10h_Rev_Gd.pdf
Fixes: 279f146143 ("x86: apic: Use tsc deadline for oneshot when available")
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241015061522.25288-1-rui.zhang%40intel.com
'mon_info' is already zeroed in the list_for_each_entry() loop below. There
is no need to explicitly initialize it here. It just wastes some space and
cycles.
Remove this un-needed code.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
74967 5103 1880 81950 1401e arch/x86/kernel/cpu/resctrl/rdtgroup.o
After:
=====
text data bss dec hex filename
74903 5103 1880 81886 13fde arch/x86/kernel/cpu/resctrl/rdtgroup.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/b2ebc809c8b6c6440d17b12ccf7c2d29aaafd488.1720868538.git.christophe.jaillet@wanadoo.fr
Commit
f69759be25 ("x86/CPU/AMD: Move Zenbleed check to the Zen2 init function")
causes a bit in the DE_CFG MSR to get set erroneously after a microcode late
load.
The microcode late load path calls into amd_check_microcode() and subsequently
zen2_zenbleed_check(). Since the above commit removes the cpu_has_amd_erratum()
call from zen2_zenbleed_check(), this will cause all non-Zen2 CPUs to go
through the function and set the bit in the DE_CFG MSR.
Call into the Zenbleed fix path on Zen2 CPUs only.
[ bp: Massage commit message, use cpu_feature_enabled(). ]
Fixes: f69759be25 ("x86/CPU/AMD: Move Zenbleed check to the Zen2 init function")
Signed-off-by: John Allen <john.allen@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20240923164404.27227-1-john.allen@amd.com
ftrace_regs was created to hold registers that store information to save
function parameters, return value and stack. Since it is a subset of
pt_regs, it should only be used by its accessor functions. But because
pt_regs can easily be taken from ftrace_regs (on most archs), it is
tempting to use it directly. But when running on other architectures, it
may fail to build or worse, build but crash the kernel!
Instead, make struct ftrace_regs an empty structure and have the
architectures define __arch_ftrace_regs and all the accessor functions
will typecast to it to get to the actual fields. This will help avoid
usage of ftrace_regs directly.
Link: https://lore.kernel.org/all/20241007171027.629bdafd@gandalf.local.home/
Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "x86@kernel.org" <x86@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/20241008230628.958778821@goodmis.org
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since X86_FEATURE_ENTRY_IBPB will invalidate all harmful predictions
with IBPB, no software-based untraining of returns is needed anymore.
Currently, this change affects retbleed and SRSO mitigations so if
either of the mitigations is doing IBPB and the other one does the
software sequence, the latter is not needed anymore.
[ bp: Massage commit message. ]
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Johannes Wikner <kwikner@ethz.ch>
Cc: <stable@kernel.org>
entry_ibpb() is designed to follow Intel's IBPB specification regardless
of CPU. This includes invalidating RSB entries.
Hence, if IBPB on VMEXIT has been selected, entry_ibpb() as part of the
RET untraining in the VMEXIT path will take care of all BTB and RSB
clearing so there's no need to explicitly fill the RSB anymore.
[ bp: Massage commit message. ]
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Johannes Wikner <kwikner@ethz.ch>
Cc: <stable@kernel.org>
Set this flag if the CPU has an IBPB implementation that does not
invalidate return target predictions. Zen generations < 4 do not flush
the RSB when executing an IBPB and this bug flag denotes that.
[ bp: Massage. ]
Signed-off-by: Johannes Wikner <kwikner@ethz.ch>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
After a recent LLVM change [1] that deduces __cold on functions that only call
cold code (such as __init functions), there is a section mismatch warning from
__get_mem_config_intel(), which got moved to .text.unlikely. as a result of
that optimization:
WARNING: modpost: vmlinux: section mismatch in reference: \
__get_mem_config_intel+0x77 (section: .text.unlikely.) -> thread_throttle_mode_init (section: .init.text)
Mark __get_mem_config_intel() as __init as well since it is only called
from __init code, which clears up the warning.
While __rdt_get_mem_config_amd() does not exhibit a warning because it
does not call any __init code, it is a similar function that is only
called from __init code like __get_mem_config_intel(), so mark it __init
as well to keep the code symmetrical.
CONFIG_SECTION_MISMATCH_WARN_ONLY=n would turn this into a fatal error.
Fixes: 05b93417ce ("x86/intel_rdt/mba: Add primary support for Memory Bandwidth Allocation (MBA)")
Fixes: 4d05bf71f1 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: <stable@kernel.org>
Link: 6b11573b8c [1]
Link: https://lore.kernel.org/r/20240917-x86-restctrl-get_mem_config_intel-init-v3-1-10d521256284@kernel.org
The resctrl schemata file supports specifying memory bandwidth associated with
the Memory Bandwidth Allocation (MBA) feature via a percentage (this is the
default) or bandwidth in MiBps (when resctrl is mounted with the "mba_MBps"
option).
The allowed range for the bandwidth percentage is from
/sys/fs/resctrl/info/MB/min_bandwidth to 100, using a granularity of
/sys/fs/resctrl/info/MB/bandwidth_gran. The supported range for the MiBps
bandwidth is 0 to U32_MAX.
There are two issues with parsing of MiBps memory bandwidth:
* The user provided MiBps is mistakenly rounded up to the granularity
that is unique to percentage input.
* The user provided MiBps is parsed using unsigned long (thus accepting
values up to ULONG_MAX), and then assigned to u32 that could result in
overflow.
Do not round up the MiBps value and parse user provided bandwidth as the u32
it is intended to be. Use the appropriate kstrtou32() that can detect out of
range values.
Fixes: 8205a078ba ("x86/intel_rdt/mba_sc: Add schemata support")
Fixes: 6ce1560d35 ("x86/resctrl: Switch over to the resctrl mbps_val list")
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Martin Kletzander <nert.pinx@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Add new PCI ID for Device 18h and Function 4.
Signed-off-by: Richard Gong <richard.gong@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20240913162903.649519-1-richard.gong@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Define helper get_this_hybrid_cpu_native_id() to return the CPU core
native ID. This core native ID combining with core type can be used to
figure out the CPU core uarch uniquely.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lkml.kernel.org/r/20240820073853.1974746-3-dapeng1.mi@linux.intel.com
- Fix pKVM error path on init, making sure we do not change critical
system registers as we're about to fail
- Make sure that the host's vector length is at capped by a value
common to all CPUs
- Fix kvm_has_feat*() handling of "negative" features, as the current
code is pretty broken
- Promote Joey to the status of official reviewer, while James steps
down -- hopefully only temporarly
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmb++hkACgkQI9DQutE9
ekNDyQ/9GwamcXC4KfYFtfQrcNRl/6RtlF/PFC0R6iiD1OoqNFHv2D/zscxtOj5a
nw3gbof1Y59eND/6dubDzk82/A1Ff6bXpygybSQ6LG6Jba7H+01XxvvB0SMTLJ1S
7hREe6m1EBHG/4VJk2Mx8iHJ7OjgZiTivojjZ1tY2Ez3nSUecL8prjqBFft3lAhg
rFb20iJiijoZDgEjFZq/gWDxPq5m3N51tushqPRIMJ6wt8TeLYx3uUd2DTO0MzG/
1K2vGbc1O6010jiR+PO3szi7uJFZfb58IsKCx7/w2e9AbzpYx4BXHKCax00DlGAP
0PiuEMqG82UXR5a58UQrLC2aonh5VNj7J1Lk3qLb0NCimu6PdYWyIGNsKzAF/f4s
tRVTRqcPr0RN/IIoX9vFjK3CKF9FcwAtctoO7IbxLKp+OGbPXk7Fk/gmhXKRubPR
+4L4DCcARTcBflnWDzdLaz02fr13UfhM80mekJXlS1YHlSArCfbrsvjNrh4iL+G0
UDamq8+8ereN0kT+ZM2jw3iw+DaF2kg24OEEfEQcBHZTS9HqBNVPplqqNSWRkjTl
WSB79q1G6iOYzMUQdULP4vFRv1OePgJzg/voqMRZ6fUSuNgkpyXT0fLf5X12weq9
NBnJ09Eh5bWfRIpdMzI1E1Qjfsm7E6hEa79DOnHmiLgSdVk3M9o=
=Rtrz
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.12, take #1
- Fix pKVM error path on init, making sure we do not change critical
system registers as we're about to fail
- Make sure that the host's vector length is at capped by a value
common to all CPUs
- Fix kvm_has_feat*() handling of "negative" features, as the current
code is pretty broken
- Promote Joey to the status of official reviewer, while James steps
down -- hopefully only temporarly
Guard them with CONFIG_KVM_X86_COMMON rather than the two vendor modules.
In practice this has no functional change, because CONFIG_KVM_X86_COMMON
is set if and only if at least one vendor-specific module is being built.
However, it is cleaner to specify CONFIG_KVM_X86_COMMON for functions that
are used in kvm.ko.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 590b09b1d8 ("KVM: x86: Register "emergency disable" callbacks when virt is enabled")
Fixes: 6d55a94222 ("x86/reboot: Unconditionally define cpu_emergency_virt_cb typedef")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM currently invalidates the entirety of the page tables, not just
those for the memslot being touched, when a memslot is moved or deleted.
The former does not have particularly noticeable overhead, but Intel's
TDX will require the guest to re-accept private pages if they are
dropped from the secure EPT, which is a non starter. Actually,
the only reason why this is not already being done is a bug which
was never fully investigated and caused VM instability with assigned
GeForce GPUs, so allow userspace to opt into the new behavior.
* Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10
functionality that is on the horizon).
* Rework common MSR handling code to suppress errors on userspace accesses to
unsupported-but-advertised MSRs. This will allow removing (almost?) all of
KVM's exemptions for userspace access to MSRs that shouldn't exist based on
the vCPU model (the actual cleanup is non-trivial future work).
* Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the
64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv)
stores the entire 64-bit value at the ICR offset.
* Fix a bug where KVM would fail to exit to userspace if one was triggered by
a fastpath exit handler.
* Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when
there's already a pending wake event at the time of the exit.
* Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest
state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN
(the SHUTDOWN hits the VM altogether, not the nested guest)
* Overhaul the "unprotect and retry" logic to more precisely identify cases
where retrying is actually helpful, and to harden all retry paths against
putting the guest into an infinite retry loop.
* Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in
the shadow MMU.
* Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for
adding multi generation LRU support in KVM.
* Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled,
i.e. when the CPU has already flushed the RSB.
* Trace the per-CPU host save area as a VMCB pointer to improve readability
and cleanup the retrieval of the SEV-ES host save area.
* Remove unnecessary accounting of temporary nested VMCB related allocations.
* Set FINAL/PAGE in the page fault error code for EPT violations if and only
if the GVA is valid. If the GVA is NOT valid, there is no guest-side page
table walk and so stuffing paging related metadata is nonsensical.
* Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of
emulating posted interrupt delivery to L2.
* Add a lockdep assertion to detect unsafe accesses of vmcs12 structures.
* Harden eVMCS loading against an impossible NULL pointer deref (really truly
should be impossible).
* Minor SGX fix and a cleanup.
* Misc cleanups
Generic:
* Register KVM's cpuhp and syscore callbacks when enabling virtualization in
hardware, as the sole purpose of said callbacks is to disable and re-enable
virtualization as needed.
* Enable virtualization when KVM is loaded, not right before the first VM
is created. Together with the previous change, this simplifies a
lot the logic of the callbacks, because their very existence implies
virtualization is enabled.
* Fix a bug that results in KVM prematurely exiting to userspace for coalesced
MMIO/PIO in many cases, clean up the related code, and add a testcase.
* Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_
the gpa+len crosses a page boundary, which thankfully is guaranteed to not
happen in the current code base. Add WARNs in more helpers that read/write
guest memory to detect similar bugs.
Selftests:
* Fix a goof that caused some Hyper-V tests to be skipped when run on bare
metal, i.e. NOT in a VM.
* Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest.
* Explicitly include one-off assets in .gitignore. Past Sean was completely
wrong about not being able to detect missing .gitignore entries.
* Verify userspace single-stepping works when KVM happens to handle a VM-Exit
in its fastpath.
* Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmb201AUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOM1gf+Ij7dpCh0KwoNYlHfW2aCHAv3PqQd
cKMDSGxoCernbJEyPO/3qXNUK+p4zKedk3d92snW3mKa+cwxMdfthJ3i9d7uoNiw
7hAgcfKNHDZGqAQXhx8QcVF3wgp+diXSyirR+h1IKrGtCCmjMdNC8ftSYe6voEkw
VTVbLL+tER5H0Xo5UKaXbnXKDbQvWLXkdIqM8dtLGFGLQ2PnF/DdMP0p6HYrKf1w
B7LBu0rvqYDL8/pS82mtR3brHJXxAr9m72fOezRLEUbfUdzkTUi/b1vEe6nDCl0Q
i/PuFlARDLWuetlR0VVWKNbop/C/l4EmwCcKzFHa+gfNH3L9361Oz+NzBw==
=Q7kz
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull x86 kvm updates from Paolo Bonzini:
"x86:
- KVM currently invalidates the entirety of the page tables, not just
those for the memslot being touched, when a memslot is moved or
deleted.
This does not traditionally have particularly noticeable overhead,
but Intel's TDX will require the guest to re-accept private pages
if they are dropped from the secure EPT, which is a non starter.
Actually, the only reason why this is not already being done is a
bug which was never fully investigated and caused VM instability
with assigned GeForce GPUs, so allow userspace to opt into the new
behavior.
- Advertise AVX10.1 to userspace (effectively prep work for the
"real" AVX10 functionality that is on the horizon)
- Rework common MSR handling code to suppress errors on userspace
accesses to unsupported-but-advertised MSRs
This will allow removing (almost?) all of KVM's exemptions for
userspace access to MSRs that shouldn't exist based on the vCPU
model (the actual cleanup is non-trivial future work)
- Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC)
splits the 64-bit value into the legacy ICR and ICR2 storage,
whereas Intel (APICv) stores the entire 64-bit value at the ICR
offset
- Fix a bug where KVM would fail to exit to userspace if one was
triggered by a fastpath exit handler
- Add fastpath handling of HLT VM-Exit to expedite re-entering the
guest when there's already a pending wake event at the time of the
exit
- Fix a WARN caused by RSM entering a nested guest from SMM with
invalid guest state, by forcing the vCPU out of guest mode prior to
signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the
nested guest)
- Overhaul the "unprotect and retry" logic to more precisely identify
cases where retrying is actually helpful, and to harden all retry
paths against putting the guest into an infinite retry loop
- Add support for yielding, e.g. to honor NEED_RESCHED, when zapping
rmaps in the shadow MMU
- Refactor pieces of the shadow MMU related to aging SPTEs in
prepartion for adding multi generation LRU support in KVM
- Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is
enabled, i.e. when the CPU has already flushed the RSB
- Trace the per-CPU host save area as a VMCB pointer to improve
readability and cleanup the retrieval of the SEV-ES host save area
- Remove unnecessary accounting of temporary nested VMCB related
allocations
- Set FINAL/PAGE in the page fault error code for EPT violations if
and only if the GVA is valid. If the GVA is NOT valid, there is no
guest-side page table walk and so stuffing paging related metadata
is nonsensical
- Fix a bug where KVM would incorrectly synthesize a nested VM-Exit
instead of emulating posted interrupt delivery to L2
- Add a lockdep assertion to detect unsafe accesses of vmcs12
structures
- Harden eVMCS loading against an impossible NULL pointer deref
(really truly should be impossible)
- Minor SGX fix and a cleanup
- Misc cleanups
Generic:
- Register KVM's cpuhp and syscore callbacks when enabling
virtualization in hardware, as the sole purpose of said callbacks
is to disable and re-enable virtualization as needed
- Enable virtualization when KVM is loaded, not right before the
first VM is created
Together with the previous change, this simplifies a lot the logic
of the callbacks, because their very existence implies
virtualization is enabled
- Fix a bug that results in KVM prematurely exiting to userspace for
coalesced MMIO/PIO in many cases, clean up the related code, and
add a testcase
- Fix a bug in kvm_clear_guest() where it would trigger a buffer
overflow _if_ the gpa+len crosses a page boundary, which thankfully
is guaranteed to not happen in the current code base. Add WARNs in
more helpers that read/write guest memory to detect similar bugs
Selftests:
- Fix a goof that caused some Hyper-V tests to be skipped when run on
bare metal, i.e. NOT in a VM
- Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES
guest
- Explicitly include one-off assets in .gitignore. Past Sean was
completely wrong about not being able to detect missing .gitignore
entries
- Verify userspace single-stepping works when KVM happens to handle a
VM-Exit in its fastpath
- Misc cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
Documentation: KVM: fix warning in "make htmldocs"
s390: Enable KVM_S390_UCONTROL config in debug_defconfig
selftests: kvm: s390: Add VM run test case
KVM: SVM: let alternatives handle the cases when RSB filling is required
KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid
KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent
KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals
KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed
KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn
KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version
KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()
KVM: x86: Update retry protection fields when forcing retry on emulation failure
KVM: x86: Apply retry protection to "unprotect on failure" path
KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZvZ8dgAKCRCAXGG7T9hj
vhirAQCR1LAU+czZlqmx6jmKRPTGff1ss66vh04XbtgTjH+8PQEA8O5KvD/KnnxY
AnrOvrx6fTLwR6iTN7ANVvPO3kGK/w0=
=0Tol
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.12-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
"A second round of Xen related changes and features:
- a small fix of the xen-pciback driver for a warning issued by
sparse
- support PCI passthrough when using a PVH dom0
- enable loading the kernel in PVH mode at arbitrary addresses,
avoiding conflicts with the memory map when running as a Xen dom0
using the host memory layout"
* tag 'for-linus-6.12-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/pvh: Add 64bit relocation page tables
x86/kernel: Move page table macros to header
x86/pvh: Set phys_base when calling xen_prepare_pvh()
x86/pvh: Make PVH entrypoint PIC for x86-64
xen: sync elfnote.h from xen tree
xen/pciback: fix cast to restricted pci_ers_result_t and pci_power_t
xen/privcmd: Add new syscall to get gsi from dev
xen/pvh: Setup gsi for passthrough device
xen/pci: Add a function to reset device for xen
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These are only two small patches, one cleanup for arch/alpha
and a preparation patch cleaning up the handling of runtime
constants in the linker scripts.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmboHV0ACgkQYKtH/8kJ
UifHfhAAqTHHxxe+HiphGBPHN0ODyLVUs7fOQHtLOSmJlQa6x1TCR/+1nL1kTDbe
j6EcIRxZrllQZ+jZBA8z2XsAmjjBLUxCB4yu6oxYJh8OdFyqeVM/myZEr2TAyb0o
A3D9b+rfnY8sr9XaFHSHGWbh4c33cGQhACumHVAjtPvU06Voskq4pAf9ZnpGkNBe
AdKNTVG6+w84dKUNuzXcexP8d7SnsXNfd6T9+evtW/M+fziWzs3aPQr+GZED96E5
8IRldXi2nzIwm9LT5IzZAt+QvpVb2Qob1+rej9p5WpptGp840CROTo61SwaYHCMV
DDxTlmADsApWJQ3B5gDu6QS2jXT4eeOrY3JI2baeCyOV6auj15UXKiWc2QVoHOVU
6+PzlSFuLatI6WsxXfOcD0o3bfQXMKS6zCC/4eD7Y/SmmMqBbL5+d9sU5lwkiOFl
swoswF4HTwo5d6NdkSuJOt6KA/V8a68lBhKYBXHu2yuLi/LDNOaipEvBHQLzfnlY
91e5DtDiHK9CYDNkwiR+bV9rQnhA535JSlfR8VtpU/SJTTjyF+dkt9JGPdivXoIA
8Zv+DN/oyrahUtCrgzzPXahOuBrfD/WfIajsvpEK6vNPuBhscsZFg/thc70FMIXo
qn8Dmpi/CnDWFNOy0xO0cbYWrGBGn9E7kzbSZ78tUIjPUmmEKfk=
=OOMl
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"These are only two small patches, one cleanup for arch/alpha and a
preparation patch cleaning up the handling of runtime constants in the
linker scripts"
* tag 'asm-generic-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
runtime constants: move list of constants to vmlinux.lds.h
alpha: no need to include asm/xchg.h twice
The PVH entry point will need an additional set of prebuild page tables.
Move the macros and defines to pgtable_64.h, so they can be re-used.
Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Message-ID: <20240823193630.2583107-5-jason.andryuk@amd.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ
63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d
592+iDgLsema/H/0/CqfqlaNtDNY8Q0=
=HUl5
-----END PGP SIGNATURE-----
Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' updates from Al Viro:
"Just the 'struct fd' layout change, with conversion to accessor
helpers"
* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add struct fd constructors, get rid of __to_fd()
struct fd: representation change
introduce fd_file(), convert all accessors to it.
this pull request are:
"Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
"Some cleanups for shmem" from Baolin Wang. No functional changes - mode
code reuse, better function naming, logic simplifications.
"mm: some small page fault cleanups" from Josef Bacik. No functional
changes - code cleanups only.
"Various memory tiering fixes" from Zi Yan. A small fix and a little
cleanup.
"mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
"Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This
is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at all
used 16k. Useful for some system tuning things, but partivularly useful
for "the dynamic kernel stack project".
"kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
Teaches kmemleak to detect leaksage of percpu memory.
"mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
"mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work
correctly by design rather than by accident.
"mm: remove arch_make_page_accessible()" from David Hildenbrand. Some
folio conversions which make arch_make_page_accessible() unneeded.
"mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
Cleans up and fixes our handling of the resetting of the cgroup/process
peak-memory-use detector.
"Make core VMA operations internal and testable" from Lorenzo Stoakes.
Rationalizaion and encapsulation of the VMA manipulation APIs. With a
view to better enable testing of the VMA functions, even from a
userspace-only harness.
"mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in
the zswap global shrinker, resulting in improved performance.
"mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in
some missing info in /proc/zoneinfo.
"mm: replace follow_page() by folio_walk" from David Hildenbrand. Code
cleanups and rationalizations (conversion to folio_walk()) resulting in
the removal of follow_page().
"improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some
tuning to improve zswap's dynamic shrinker. Significant reductions in
swapin and improvements in performance are shown.
"mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
Improvements to the new unaccepted memory feature,
"mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX
PUDs. This was missing, although nobody seems to have notied yet.
"Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
Cleanups and modest performance improvements for the maple tree library
code.
"memcg: further decouple v1 code from v2" from Shakeel Butt. Move more
cgroup v1 remnants away from the v2 memcg code.
"memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds
various warnings telling users that memcg v1 features are deprecated.
"mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
Greatly improves the success rate of the mTHP swap allocation.
"mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate
per-arch implementations of numa_memblk code into generic code.
"mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
"support large folio swap-out and swap-in for shmem" from Baolin Wang.
With this series we no longer split shmem large folios into simgle-page
folios when swapping out shmem.
"mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance
improvements and code reductions for gigantic folios.
"support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
"mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
"Increase the number of bits available in page_type" from Matthew Wilcox.
Increases the number of bits available in page_type!
"Simplify the page flags a little" from Matthew Wilcox. Many legacy page
flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
"mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An
optimization which permits us to avoid writing/reading zero-filled zswap
pages to backing store.
"Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window
which occurs when a MAP_FIXED operqtion is occurring during an unrelated
vma tree walk.
"mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the
vma_merge() functionality, making ot cleaner, more testable and better
tested.
"misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor
fixups of DAMON selftests and kunit tests.
"mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code
cleanups and folio conversions.
"Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups
for shmem controls and stats.
"mm: count the number of anonymous THPs per size" from Barry Song. Expose
additional anon THP stats to userspace for improved tuning.
"mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
conversions and removal of now-unused page-based APIs.
"replace per-quota region priorities histogram buffer with per-context
one" from SeongJae Park. DAMON histogram rationalization.
"Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
Park. DAMON documentation updates.
"mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
related doc and warn" from Jason Wang: fixes usage of page allocator
__GFP_NOFAIL and GFP_ATOMIC flags.
"mm: split underused THPs" from Yu Zhao. Improve THP=always policy - this
was overprovisioning THPs in sparsely accessed memory areas.
"zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add
support for zram run-time compression algorithm tuning.
"mm: Care about shadow stack guard gap when getting an unmapped area" from
Mark Brown. Fix up the various arch_get_unmapped_area() implementations
to better respect guard areas.
"Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of
mem_cgroup_iter() and various code cleanups.
"mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
"resource: Fix region_intersects() vs add_memory_driver_managed()" from
Huang Ying. Fix a bug in region_intersects() for systems with CXL memory.
"mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a
couple more code paths to correctly recover from the encountering of
poisoned memry.
"mm: enable large folios swap-in support" from Barry Song. Support the
swapin of mTHP memory into appropriately-sized folios, rather than into
single-page folios.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
=s0T+
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Along with the usual shower of singleton patches, notable patch series
in this pull request are:
- "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
- "Some cleanups for shmem" from Baolin Wang. No functional changes -
mode code reuse, better function naming, logic simplifications.
- "mm: some small page fault cleanups" from Josef Bacik. No
functional changes - code cleanups only.
- "Various memory tiering fixes" from Zi Yan. A small fix and a
little cleanup.
- "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
- "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
Butt. This is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at
all used 16k. Useful for some system tuning things, but
partivularly useful for "the dynamic kernel stack project".
- "kmemleak: support for percpu memory leak detect" from Pavel
Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.
- "mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
- "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
David Hildenbrand. Improves PTE/PMD splitlock detection, makes
powerpc/8xx work correctly by design rather than by accident.
- "mm: remove arch_make_page_accessible()" from David Hildenbrand.
Some folio conversions which make arch_make_page_accessible()
unneeded.
- "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
Finkel. Cleans up and fixes our handling of the resetting of the
cgroup/process peak-memory-use detector.
- "Make core VMA operations internal and testable" from Lorenzo
Stoakes. Rationalizaion and encapsulation of the VMA manipulation
APIs. With a view to better enable testing of the VMA functions,
even from a userspace-only harness.
- "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
issues in the zswap global shrinker, resulting in improved
performance.
- "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
in some missing info in /proc/zoneinfo.
- "mm: replace follow_page() by folio_walk" from David Hildenbrand.
Code cleanups and rationalizations (conversion to folio_walk())
resulting in the removal of follow_page().
- "improving dynamic zswap shrinker protection scheme" from Nhat
Pham. Some tuning to improve zswap's dynamic shrinker. Significant
reductions in swapin and improvements in performance are shown.
- "mm: Fix several issues with unaccepted memory" from Kirill
Shutemov. Improvements to the new unaccepted memory feature,
- "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
DAX PUDs. This was missing, although nobody seems to have notied
yet.
- "Introduce a store type enum for the Maple tree" from Sidhartha
Kumar. Cleanups and modest performance improvements for the maple
tree library code.
- "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
more cgroup v1 remnants away from the v2 memcg code.
- "memcg: initiate deprecation of v1 features" from Shakeel Butt.
Adds various warnings telling users that memcg v1 features are
deprecated.
- "mm: swap: mTHP swap allocator base on swap cluster order" from
Chris Li. Greatly improves the success rate of the mTHP swap
allocation.
- "mm: introduce numa_memblks" from Mike Rapoport. Moves various
disparate per-arch implementations of numa_memblk code into generic
code.
- "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
- "support large folio swap-out and swap-in for shmem" from Baolin
Wang. With this series we no longer split shmem large folios into
simgle-page folios when swapping out shmem.
- "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
performance improvements and code reductions for gigantic folios.
- "support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
- "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
- "Increase the number of bits available in page_type" from Matthew
Wilcox. Increases the number of bits available in page_type!
- "Simplify the page flags a little" from Matthew Wilcox. Many legacy
page flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
- "mm: store zero pages to be swapped out in a bitmap" from Usama
Arif. An optimization which permits us to avoid writing/reading
zero-filled zswap pages to backing store.
- "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
window which occurs when a MAP_FIXED operqtion is occurring during
an unrelated vma tree walk.
- "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
the vma_merge() functionality, making ot cleaner, more testable and
better tested.
- "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
Minor fixups of DAMON selftests and kunit tests.
- "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
Code cleanups and folio conversions.
- "Shmem mTHP controls and stats improvements" from Ryan Roberts.
Cleanups for shmem controls and stats.
- "mm: count the number of anonymous THPs per size" from Barry Song.
Expose additional anon THP stats to userspace for improved tuning.
- "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
folio conversions and removal of now-unused page-based APIs.
- "replace per-quota region priorities histogram buffer with
per-context one" from SeongJae Park. DAMON histogram
rationalization.
- "Docs/damon: update GitHub repo URLs and maintainer-profile" from
SeongJae Park. DAMON documentation updates.
- "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
improve related doc and warn" from Jason Wang: fixes usage of page
allocator __GFP_NOFAIL and GFP_ATOMIC flags.
- "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
This was overprovisioning THPs in sparsely accessed memory areas.
- "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
Add support for zram run-time compression algorithm tuning.
- "mm: Care about shadow stack guard gap when getting an unmapped
area" from Mark Brown. Fix up the various arch_get_unmapped_area()
implementations to better respect guard areas.
- "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
of mem_cgroup_iter() and various code cleanups.
- "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
- "resource: Fix region_intersects() vs add_memory_driver_managed()"
from Huang Ying. Fix a bug in region_intersects() for systems with
CXL memory.
- "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
a couple more code paths to correctly recover from the encountering
of poisoned memry.
- "mm: enable large folios swap-in support" from Barry Song. Support
the swapin of mTHP memory into appropriately-sized folios, rather
than into single-page folios"
* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
zram: free secondary algorithms names
uprobes: turn xol_area->pages[2] into xol_area->page
uprobes: introduce the global struct vm_special_mapping xol_mapping
Revert "uprobes: use vm_special_mapping close() functionality"
mm: support large folios swap-in for sync io devices
mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
set_memory: add __must_check to generic stubs
mm/vma: return the exact errno in vms_gather_munmap_vmas()
memcg: cleanup with !CONFIG_MEMCG_V1
mm/show_mem.c: report alloc tags in human readable units
mm: support poison recovery from copy_present_page()
mm: support poison recovery from do_cow_fault()
resource, kunit: add test case for region_intersects()
resource: make alloc_free_mem_region() works for iomem_resource
mm: z3fold: deprecate CONFIG_Z3FOLD
vfio/pci: implement huge_fault support
mm/arm64: support large pfn mappings
mm/x86: support large pfn mappings
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZuu+BAAKCRCAXGG7T9hj
vs3bAP4mp0NnxnDbvPObWoPKmLk5OvHdfY9cV+/M+r/UObfyswD+OYaZH0hVCHP6
L96RzSHE+Q1pKPNpQfMOPcCDFmO3wwI=
=cN0H
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- fix a boot problem as a Xen dom0 on some AMD systems
- fix Xen PVH boot problems with KASAN enabled
- fix for a build warning
- fixes to swiotlb-xen
* tag 'for-linus-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/swiotlb: fix allocated size
xen/swiotlb: add alignment check for dma buffers
xen/pci: Avoid -Wflex-array-member-not-at-end warning
xen/xenbus: Convert to use ERR_CAST()
xen, pvh: fix unbootable VMs by inlining memset() in xen_prepare_pvh()
x86/cpu: fix unbootable VMs by inlining memcmp() in hypervisor_cpuid_base()
xen, pvh: fix unbootable VMs (PVH + KASAN - AMD_MEM_ENCRYPT)
xen: tolerate ACPI NVS memory overlapping with Xen allocated memory
xen: allow mapping ACPI data using a different physical address
xen: add capability to remap non-RAM pages to different PFNs
xen: move max_pfn in xen_memory_setup() out of function scope
xen: move checks for e820 conflicts further up
xen: introduce generic helper checking for memory map conflicts
xen: use correct end address of kernel for conflict checking
KVM VMX and x86 PAT MSR macro cleanup for 6.12:
- Add common defines for the x86 architectural memory types, i.e. the types
that are shared across PAT, MTRRs, VMCSes, and EPTPs.
- Clean up the various VMX MSR macros to make the code self-documenting
(inasmuch as possible), and to make it less painful to add new macros.
- Use the topology information of number of packages for making the
decision about TSC trust instead of using the number of online nodes
which is not reflecting the real topology.
- Stop the PIT timer 0 when its not in use as to stop pointless emulation
in the VMM.
- Fix the PIT timer stop sequence for timer 0 so it truly stops both real
hardware and buggy VMM emulations.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpN3MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVAKEADAr379sye4HNn9STpFGKsLWGzsZlch
u5QaR0Nq0WvjO9Rd7+CfeA4AnvXCVwhG70Ut5hEfQEqlpJ62CZrjnAp4YSyaTdyA
16X22z0Pcy7iq0FeaB5C1HK11AMNfpJyQsj3zLWqIrHcwPmPppCRhHpL6RC/pOrL
QEPsG12+kAzfqQVTb6jkNaCezlLHZauJxdQMYqm74uQByfn/jFi4DdNLXgUrY8mJ
gCBBubbF80aBxA6/ZY8aV19zXfklHyxp/u0Y+pVUMgCdyVmh1+Yb5vF4f9J/wbQk
h5k3Z04I4n7/uH9USA6A5MG/6Wsj2fV5JAa2QH+9jM7dLMDAviPyMhsmaCSdOXlQ
fjZczvXTCx5JwIFyGU5sL/ma3mrPkUugiq8LA17rfrclS8KxsxHVOh8TLueF8cIe
5URYIlGg3uDn567rLgUDqieA7HxDxx2Ykqq3aiagNTSaHETFC41oef7Ju01ueriy
KiWb7Q6kPifZ1Z5L+UJGKK/HPp2+ilCQqQmhwToEWmRKCuZgeje2wq37bjk6Z7sV
XAXuxW16qn+2y6aHay/OAK6XAfxk3ZX7YGd1yXYuOfC8phJygCkXWq9rsjufLokz
KTwH2Zj8MlMjfiqvG87aoJkEPy3hIUgIIem+MID4Ff4ERFo0pIL1PAOROnIa/0KN
KDsLPVW4e/S0jA==
=1vKt
-----END PGP SIGNATURE-----
Merge tag 'x86-timers-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 timer updates from Thomas Gleixner:
- Use the topology information of number of packages for making the
decision about TSC trust instead of using the number of online nodes
which is not reflecting the real topology.
- Stop the PIT timer 0 when its not in use as to stop pointless
emulation in the VMM.
- Fix the PIT timer stop sequence for timer 0 so it truly stops both
real hardware and buggy VMM emulations.
* tag 'x86-timers-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Check for sockets instead of CPUs to make code match comment
clockevents/drivers/i8253: Fix stop sequence for timer 0
x86/i8253: Disable PIT timer 0 when not in use
x86/tsc: Use topology_max_packages() to get package number
- Rework kcpuid to handle the the autogenerated CSV file correctly and
update the CSV file to cover the whole zoo of CPUID.
- Avoid memcpy() for ia32 syscall_get_arguments() and use direct
assignments as fortified memcpy() is unhappy about writing/reading
beyond the end of the addresses destination/source struct member
- A few new PCI IDs for AMD
- Update MAINTAINERS to cover x86 specific selftests
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpOZ8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofVUEACt8JjMxanswpMy1O6HbJcdVf2wwZ3q
n30BKIFXucvqE6Opc7tWy5THh1+YjHuNXZMkfuuEe2Qjc69z2m3YwUmF0oAB9/AI
6HU4yoePHTbEiPbTjNZMaKL+9CaYJbWkgoEjQpdQGWmo6gJqJxoRF5fY2assLfdJ
zik2faebMNj3l1C1R1w646Zu3CScfZUE8512zwBfOxTqkpVBO4uDrspTzLYljlQN
+gPZ41XDvQKu6SVoVC/TH/oRdshtLBg74fUDoL14yMkWqx3N5IKulFIMCeD2dEHv
pJcbYb8x0pJ1iLx8q/k+spzbvTewY3sAAzbo5JLvcHy1PhW8jc+uCWorMpqLEhH0
LzH1XZwC+kYvJytzZ9EEyYJAAMbh3KRBaphEXmRVec19tujwRy2NGjhRyVmLyqYr
aShIGEVqigCGY8dF0mJgyVu5kd7X4vDZw4xH92c5/G41Ui19cXp1nXh61KMs1WMR
sQm9FDvtRgcX9Pc89RyRRgYz2U75p3gcNyXKio4Oa2VfIlGRYUB5kg5/qDx3RjJx
kZZ44TqPA/oJjpJyNjVrYqD6Gd3WUsjuH2gn6IAohKiSEKDdGTtHu7LEnKEcdkQk
TomxWk1fTR8513GNXgEy2YhXdRN8iTlhgRI9G2BA5c4B6MCGHzPRFzWrosogB3+g
tAOsEN8Sp3ea+g==
=XVR5
-----END PGP SIGNATURE-----
Merge tag 'x86-misc-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Thomas Gleixner:
- Rework kcpuid to handle the the autogenerated CSV file correctly and
update the CSV file to cover the whole zoo of CPUID.
- Avoid memcpy() for ia32 syscall_get_arguments() and use direct
assignments as fortified memcpy() is unhappy about writing/reading
beyond the end of the addresses destination/source struct member
- A few new PCI IDs for AMD
- Update MAINTAINERS to cover x86 specific selftests
* tag 'x86-misc-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Add selftests/x86 entry
x86/amd_nb: Add new PCI IDs for AMD family 1Ah model 60h-70h
x86/syscall: Avoid memcpy() for ia32 syscall_get_arguments()
MAINTAINERS: Add x86 cpuid database entry
tools/x86/kcpuid: Introduce a complete cpuid bitfields CSV file
tools/x86/kcpuid: Parse subleaf ranges if provided
tools/x86/kcpuid: Recognize all leaves with subleaves
tools/x86/kcpuid: Strip bitfield names leading/trailing whitespace
tools/x86/kcpuid: Protect against faulty "max subleaf" values
tools/x86/kcpuid: Set max possible subleaves count to 64
tools/x86/kcpuid: Properly align long-description columns
tools/x86/kcpuid: Remove unused variable
x86/amd_nb: Add new PCI IDs for AMD family 1Ah model 60h
- Make LAM enablement safe vs. kernel threads using a process mm
temporarily as switching back to the process would not update CR3 and
therefore not enable LAM causing faults in user space when using tagged
pointers. Cure it by synchronizing LAM enablement via IPIs to all CPUs
which use the related mm.
- Cure a LAM harmless inconsistency between CR3 and the state during
context switch. It's both confusing and prone to lead to real bugs
- Handle alt stack handling for threads which run with a non-zero
protection key. The non-zero key prevents the kernel to access the
alternate stack. Cure it by temporarily enabling all protection keys for
the alternate stack setup/restore operations.
- Provide a EFI config table identity mapping for kexec kernel to prevent
kexec fails because the new kernel cannot access the config table array
- Use GB pages only when a full GB is mapped in the identity map as
otherwise the CPU can speculate into reserved areas after the end of
memory which causes malfunction on UV systems.
- Remove the noisy and pointless SRAT table dump during boot
- Use is_ioremap_addr() for iounmap() address range checks instead of
high_memory. is_ioremap_addr() is more precise.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpPpYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYddD/9HeH5/rpWS3JU4ZVC+huY28uJuwAFW
ER48zniRbmuz8y+dZZ6K8uvqoWB+ro+yNjA9Jhm9nHUzhs7kE5O8+bmkUi6HXViW
6zS6PW95+u80dmSGy1Gna0SU3158OyBf2X61SySJABLLek7WwrR7jakkgrDBVtL5
ILKS/dUwIrUPoVlszCh9uE0Kj6gdFquooE06sif5EIibnhSgSXfr2EbGj0Qq/YYf
FYfpggSSVpTXFSkZSB2VCEqK66jaGUfKzZ6v1DkSioChUCsky2OO6zD9pk0dMixO
a/0XvRUo3OhiXZbj1tPUtxaEBgJdigpsxke7xQSVxSl+DNNuapiybpgAzFM5Xh+m
yFcP66nIpJcHE10vjVR3jSUlTSb2zk+v9d1Ujj10G1h8RHLTfsTCRHgzs7P0/nkE
NJleWstYVRV5rFpPLoY0ryQmjW/PzYokkaqWKI12Lhxg4ojijZso3pS8WfOsk1/B
081tOZERWeGnJEOOJwwYE1wt0Qq8th4S9b2/fz3vk2fsEHIf42s4fKQwy1CxKopb
PyIrgnZyWx6ueX9QaIGIzGV1GsY4FKMgFJVOyVb0D0stMnr1ty2m3993eNs/nCXy
+rHPMwFteLcwiWp/C3hq5IQd7uEvmRt/mYJ5hdvCj5wCIkXI3JtgsXfLSVs3Ln4f
R6HvZehYmbJoNQ==
=VZcR
-----END PGP SIGNATURE-----
Merge tag 'x86-mm-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 memory management updates from Thomas Gleixner:
- Make LAM enablement safe vs. kernel threads using a process mm
temporarily as switching back to the process would not update CR3 and
therefore not enable LAM causing faults in user space when using
tagged pointers. Cure it by synchronizing LAM enablement via IPIs to
all CPUs which use the related mm.
- Cure a LAM harmless inconsistency between CR3 and the state during
context switch. It's both confusing and prone to lead to real bugs
- Handle alt stack handling for threads which run with a non-zero
protection key. The non-zero key prevents the kernel to access the
alternate stack. Cure it by temporarily enabling all protection keys
for the alternate stack setup/restore operations.
- Provide a EFI config table identity mapping for kexec kernel to
prevent kexec fails because the new kernel cannot access the config
table array
- Use GB pages only when a full GB is mapped in the identity map as
otherwise the CPU can speculate into reserved areas after the end of
memory which causes malfunction on UV systems.
- Remove the noisy and pointless SRAT table dump during boot
- Use is_ioremap_addr() for iounmap() address range checks instead of
high_memory. is_ioremap_addr() is more precise.
* tag 'x86-mm-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioremap: Improve iounmap() address range checks
x86/mm: Remove duplicate check from build_cr3()
x86/mm: Remove unused NX related declarations
x86/mm: Remove unused CR3_HW_ASID_BITS
x86/mm: Don't print out SRAT table information
x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
x86/kexec: Add EFI config table identity mapping for kexec kernel
selftests/mm: Add new testcases for pkeys
x86/pkeys: Restore altstack access in sigreturn()
x86/pkeys: Update PKRU to enable all pkeys before XSAVE
x86/pkeys: Add helper functions to update PKRU on the sigframe
x86/pkeys: Add PKRU as a parameter in signal handling functions
x86/mm: Cleanup prctl_enable_tagged_addr() nr_bits error checking
x86/mm: Fix LAM inconsistency during context switch
x86/mm: Use IPIs to synchronize LAM enablement
- Enable FRED right after init_mem_mapping() because at that point the
early IDT fault handler is replaced by the real fault handler. The real
fault handler retrieves the faulting address from the stack frame and
not from CR2 when the FRED feature is set. But that obviously only
works when FRED is enabled in the CPU as well.
- Set SS to __KERNEL_DS when enabling FRED to prevent a corner case where
ERETS can observe a SS mismatch and raises a #GP.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpNZITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobh3EACsU/WhmWG0pjqNs+92i/Hjd5QHRxX8
WkyB+j0FQ3ZtQ0aqn73G/VxITxCMAE1fwC2iERlN/9eXjGXcwxeaM9upsMs9gq7v
HmiOPSixn6hH7ulQ6WzDnM478pSnN4lmaZVY2ll1O3z8r79dW2Kz34zSqQCxDGcQ
3sCJkHr7F0YClUaYxH/dok68F69aZXhU4V9URE30Ec74hnomYd4VuFkHwuA77rHG
k81lHxSY9/Ttha91CPiK3/lU+lbehYNNZQ+PzUxkNmm9dlzXI8Vl5JRPJGIlYpWQ
A9L1ZjV4kZcB+tcXPV1bOW+lVSefGVquAia5RgCyUylIFCOtsR/wCoezS3f17Zhf
Ry+kfkYwuDgD0IYNVp6L3+Fx0LtBJT3BorhnS7YhhiqvLW0EpGe/bBzzRFntp4oR
TmRAA3nNn3DBCky3rfGg0TWwqfvy/7c6SPY1Zw1SEmqtDdHB/DyKGt+BVQQ2kqWO
tCtGAMjcE7Cfgca7mI7wILjY7MFirTQW0js6UL5mw22rhZxKV5S9m7N8KkUnFh3S
acjQ1nL5ZBQ9cKdEGrLNHQjfSSc9ju7aXsGXm5c+vrqKbMG8+Nj+1cvzxaLL5xVY
LLKACw5rl0LVXHU5H3IwvS+GMipklrmouikdoI4P8vHMd9GBquR4znO3MzqaLtg2
F1IBXL07s2SYrw==
=cKRu
-----END PGP SIGNATURE-----
Merge tag 'x86-fred-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FRED updates from Thomas Gleixner:
- Enable FRED right after init_mem_mapping() because at that point the
early IDT fault handler is replaced by the real fault handler. The
real fault handler retrieves the faulting address from the stack
frame and not from CR2 when the FRED feature is set. But that
obviously only works when FRED is enabled in the CPU as well.
- Set SS to __KERNEL_DS when enabling FRED to prevent a corner case
where ERETS can observe a SS mismatch and raises a #GP.
* tag 'x86-fred-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Set FRED RSP0 on return to userspace instead of context switch
x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
x86/entry: Test ti_work for zero before processing individual bits
x86/fred: Set SS to __KERNEL_DS when enabling FRED
x86/fred: Enable FRED right after init_mem_mapping()
x86/fred: Move FRED RSP initialization into a separate function
x86/fred: Parse cmdline param "fred=" in cpu_parse_early_param()
Debuggers have guess the FPU buffer layout in core dumps, which is error
prone. This is because AMD and Intel layouts differ.
To avoid buggy heuristics add a ELF section which describes the buffer
layout which can be retrieved by tools.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpOuwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTRAEACGHPdAYFp5A396c9qUbHUE2gEKIad2
iuq15TZKLPY/LFqfTwnkp9/nqKtZ0gj4D6XCIucWZjwWJuPgvgGf/tC9Fk+H+C6X
9+rycP3GdqxU28qLxA428SN2Pg3lvqG4rryVWeHUXQ4x8A0DSMV+3pkNY5YgJ+2+
fTzNzVi2tkPRAXhKmj3EdcFcgDPiFQBMm1QNBpc+FqrXk4rjJb9Axln0oT8xemDv
TtJ5BMhFpR73naaiS4IrK8Tk3oFCa8CmafCQfl1zAOor/+EemPQKwMuGeiXE7dLG
eE+OTw5zuxYwlc9WoaPmM/ZiEc5JptpHQUtyHDBN7BaK87VKjsupAXXVOh6XMRCt
R2coqq7fqDqMANwWpUKddky3vSwbst1GZpXGAENOy64yU4VoFutr616WSj3sJfUi
knBauPqLAFeZLhMn/kKr5a0rBgm7VuQSlGPYEhqVdaM3Eb/zJEupFL/bTpqQbbz/
8lo2hYcfDslhShcEZYBwm4eUg+ytZ96K3ciZ5YgNih9LFBxEOo0SY1CqbQJiRtpB
3DmgldYtzRdQq5/JtFGNv717uMESn5khG3qHUpXtrDhWfD8spMWiY1yO/cwWvLFJ
ZS5ATp1dAt1Pbv2MC6r9jQBbW3V7xNNAOJdzUvIZPP04PKeV0ObFOplxhabOzUDj
OLquyIrjpxeisg==
=Vqqo
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
"Provide FPU buffer layout in core dumps:
Debuggers have guess the FPU buffer layout in core dumps, which is
error prone. This is because AMD and Intel layouts differ.
To avoid buggy heuristics add a ELF section which describes the buffer
layout which can be retrieved by tools"
* tag 'x86-fpu-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/elf: Add a new FPU buffer layout info to x86 core files
metadata encodeded into UD1.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpM6ITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoU/kEACWS7Z9mQrWB3r22ufTTPoN+hNudth+
CP8wluXZGvLPh1Pq9dpB9ZniBUN8levYoGyj3NTdr6VtoMJ6NYcZVuH98lCCEMXO
1UmDpydSGZ3BqVgmf4h0eYAJgEiA5qTflXMsh6SfsaPQR7jniJTE451hgJdRIogG
DvgWeVTYn5vt0+oRHJp6ogRLR9oOUgdp94fIwaW34OpesbVJeWUW9zAvBcqdNrDT
KJIM7ta6eivEakFRxriQZTKRc+3ElvZ2fdWNdo9qrRd64MTIOTXAj3G0lXt3YtpZ
06pfJ1CfQ+nwHKfxmmy4gz4eJG7KcpMM+KFZTR3NoSAz4oMTzAvVTxAuEt+pahx6
bmLzaY/I/gRB/Rt+e5oEZSEIq+Sh/Lm3IZoQUhK0+HeJBjwPghBZw3BjkFJvEsMw
S0arvklH2x37gP9rnzOODf2QG7aIAqLTrvRJS610fctwadR4k+2UIE8ZGHOTt55J
UdiK/QhU4gMVaRTebTcPquu3IMmnJjla/bEWdIrBtOSiGtVd1BnAp/kvmkdQH3eI
ZUqJbnfofN4rzSufFqSVY88ORVIcQMnNDLM0qyJofIC79u7OiU40icoDxWS6mDHQ
wQSEszInhwNzyAxoHnNkXDunjDVKhATQPOde0F4TxLcrYD9KRpvJag/1j5fCQi+0
ftODZflfGS2UjQ==
=Z5Hg
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core update from Thomas Gleixner:
"Enable UBSAN traps for x86, which provides better reporting through
metadata encodeded into UD1"
* tag 'x86-core-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/traps: Enable UBSAN traps on x86
- Handle an allocation failure in the IO/APIC code gracefully instead of
crashing the machine.
- Remove support for APIC local destination mode on 64bit
Logical destination mode of the local APIC is used for systems with up
to 8 CPUs. It has an advantage over physical destination mode as it
allows to target multiple CPUs at once with IPIs. That advantage was
definitely worth it when systems with up to 8 CPUs were state of the
art for servers and workstations, but that's history.
In the recent past there were quite some reports of new laptops failing
to boot with logical destination mode, but they work fine with physical
destination mode. That's not a suprise because physical destination
mode is guaranteed to work as it's the only way to get a CPU up and
running via the INIT/INIT/STARTUP sequence. Some of the affected
systems were cured by BIOS updates, but not all OEMs provide them.
As the number of CPUs keep increasing, logical destination mode becomes
less used and the benefit for small systems, like laptops, is not
really worth the trouble. So just remove logical destination mode
support for 64bit and be done with it.
- Code and comment cleanups in the APIC area.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpL0gTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYob/VD/984H4Ku5/Djq9HkhBO11hfRTIVz/uf
1/b5ogd3eN0dK5nAv79/Gj7E/zntVsvCjuCYckXz51xPxkQH2LxUhDKqeUwg5lmz
xQV0mKK4fIS/g5yymQGplKc7FfjRAnVL9ZZRRvMkvtqbr1+dA665XrfjFAPkp929
zLaBUbNC6YxYfSddsV+fE8711QP6NzCYdeEBIdZ3NuBrlGfiLy1g1OWCk8za7zjM
cLJfGnU63MNXI4smrZWrQwJDBOiQl1wPbJYWL216OPHofLzLNGNZFXm4y8OJcyN0
WPWn1TliAwpRYx18Z/cEPgkoES8mXqqpPcoo0yBjOmPLl31J6QYU7QQhDb3HOnM/
ALgnnuhoWll5YjNBPJkONAa7lpnmfTbEg82WxaipEscz9CyEBoeOLvYBGPl/YqV+
B8wMOZHDH+BchJ6rYXDA1AmkD+9q86F+ddbiVOKj09dVm/QeLrGjwox1O7yGALGZ
hZPQx9MsTOJqQIh40PsqFko6OiMKuMBIebacFb4NqmVA2/WbRbcmkzRyxk+kkBFv
UMZX5O6sQhat615WZkxTnjmdnXETTIlv4nRQURBd/LF6ECRkXXG11dWaZfTXZ9iW
8NNlHw8mIbGmzn7wWXHlhk7N7vuhWCikAf7V2y+eZUVtE56qGM2volJNCmTZacP2
rrjmltwEGR+5gg==
=Y3a/
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
- Handle an allocation failure in the IO/APIC code gracefully instead
of crashing the machine.
- Remove support for APIC local destination mode on 64bit
Logical destination mode of the local APIC is used for systems with
up to 8 CPUs. It has an advantage over physical destination mode as
it allows to target multiple CPUs at once with IPIs. That advantage
was definitely worth it when systems with up to 8 CPUs were state of
the art for servers and workstations, but that's history.
In the recent past there were quite some reports of new laptops
failing to boot with logical destination mode, but they work fine
with physical destination mode. That's not a suprise because physical
destination mode is guaranteed to work as it's the only way to get a
CPU up and running via the INIT/INIT/STARTUP sequence. Some of the
affected systems were cured by BIOS updates, but not all OEMs provide
them.
As the number of CPUs keep increasing, logical destination mode
becomes less used and the benefit for small systems, like laptops, is
not really worth the trouble. So just remove logical destination mode
support for 64bit and be done with it.
- Code and comment cleanups in the APIC area.
* tag 'x86-apic-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Fix comment on IRQ vector layout
x86/apic: Remove unused extern declarations
x86/apic: Remove logical destination mode for 64-bit
x86/apic: Remove unused inline function apic_set_eoi_cb()
x86/ioapic: Cleanup remaining coding style issues
x86/ioapic: Cleanup line breaks
x86/ioapic: Cleanup bracket usage
x86/ioapic: Cleanup comments
x86/ioapic: Move replace_pin_at_irq_node() to the call site
iommu/vt-d: Cleanup apic_printk()
x86/mpparse: Cleanup apic_printk()s
x86/ioapic: Cleanup guarded debug printk()s
x86/ioapic: Cleanup apic_printk()s
x86/apic: Cleanup apic_printk()s
x86/apic: Provide apic_printk() helpers
x86/ioapic: Use guard() for locking where applicable
x86/ioapic: Cleanup structs
x86/ioapic: Mark mp_alloc_timer_irq() __init
x86/ioapic: Handle allocation failures gracefully
- Use memremap() for the EISA probe instrad of ioremap(). EISA is
strictly memory and not MMIO
- Cleanups and enhancement all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpMzcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoa1WD/4txviyFr+1IY/P/JLxE8cBCW3R3aDY
7+15lGBHiWyJ+uamzlAv8OQab/brgh5ofnRQjkrvK7pLVb7XgBacncFT8tF/j83w
Yw+36NMAkeVAt2rJbWz1ZdgpK+StFMFmXcclv+BL5m0aTuGP1IsJX3KbbpMAYlyY
ju++UAm0c/CSjRyuks1HgqADZ2Q8pjQv3dN723BRBxgRv0b3IcFAl7bBdZGf/w5w
PBC7mFg7x0dAVW3Dpb73VeeNuAJ1LolTasS+OZglo/fhNx1hVHTYInewZ24t37px
xDSDoYSJq0qQsG6T660gEduVqay80A8Jwu9Mwu+0G7krbuSafqDOqcPlFWPMUbiy
VP6EPUh1FaJsH+IxloU5nyfmU6DaukYh1cPkGJBfUyCLG4KDyodIxL5c1c3cG90Y
umK+Ggy3vNbgcLBGJWUgqS9ET55qcxMc+X3DMlnQl+pGhFdkC9cHCTUqSJRwLeuj
4Dvk76zX1VNGmPmr77kP+rIZl9hqmfw4I2hekUaETSuWOAsf/xHzH/TlcOnPVSr0
jidxNvHQ0kuRziCeBH7RUU8jpZyepCY4SIvJt+C2f6pZv/82lOao/ZIqVhyNR5Jh
+zLr+UU6PtxNYyYjg1zcL0FCa6jz40Z2el0cPChoK0xqwOVAPGu/HiqCQW0AmXJR
+Dl/gGrb68vFsg==
=aN01
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Thomas Gleixner:
"A set of cleanups across x86:
- Use memremap() for the EISA probe instead of ioremap(). EISA is
strictly memory and not MMIO
- Cleanups and enhancement all over the place"
* tag 'x86-cleanups-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/EISA: Dereference memory directly instead of using readl()
x86/extable: Remove unused declaration fixup_bug()
x86/boot/64: Strip percpu address space when setting up GDT descriptors
x86/cpu: Clarify the error message when BIOS does not support SGX
x86/kexec: Add comments around swap_pages() assembly to improve readability
x86/kexec: Fix a comment of swap_pages() assembly
x86/sgx: Fix a W=1 build warning in function comment
x86/EISA: Use memremap() to probe for the EISA BIOS signature
x86/mtrr: Remove obsolete declaration for mtrr_bp_restore()
x86/cpu_entry_area: Annotate percpu_setup_exception_stacks() as __init
- Prevent spurious KCOV coverage in common_interrupt()
- Fixup the KCOV Makefile directive which got stale due to a source file
rename
- Exclude stack unwinding from KCOV as it creates large amounts of
uninteresting coverage
- Provide a self test to validate that KCOV coverage of the interrupt
handling code starts not before preempt count got updated.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpMeITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaOeD/4oO3g0soK0LIcDIwzaG0ap0hx0nucw
aVSAESuY+ZaSbRbV0fNoYdHORvLdErs67SeyeJRSxTzSNqGH2dGoFrfbkRSXq951
RdCSPP60T7xgqAme1YLDiChfXt/gkbWk/8V5Q7sG3oq3GaVcPUyZgPo4M4HQMdfg
Mla3VPikW5Np3fvs0IZYWQ5VdY0fFOHY5JGMhKJznJxf+Ud+VAtxsbJUcO4MEYWW
A9CVJNHGEXssGA6vm5kgtLu6n2QFuoSj6En/WqLEaJb8f/V332e04Xj2ZHUaOOjV
2abVeDovv+dwUYb4SgrGVg9gfEwwcLPDnmOuuQJmQBB5kU4mJsCqI5TTS6c1fgU4
x8tQsGSOKHFQAI14ZWtitrL4rS2uFcBkAFXo0dF8J5o4989RA8cpfeWVSVUb/UXd
u38BWpc9iHiihHKMmMQgsa1bUMwdSUTvN5XFHkeP4oqUdMiEiWn8iM5+zXd/lfTs
9mrTv+kcLA7mjFOmn4JyE2b+NuiPdgS2FCBGLycHvGwvJoJlO2UmSpF89AJ5vdKs
F8vWLkV+gno/HtwS5o949cAwjYiCodfc7u1W0xj2VDAbx0RbaBw1SDhXMQcLxLgn
BTt4yHKKIeLX++WH3fpeyL91+UJWubUzNzY4rAmLkz5DedWAkpES+45fatp1buIz
Lp/hGiIsG9p5xw==
=tiXT
-----END PGP SIGNATURE-----
Merge tag 'x86-build-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Thomas Gleixner:
"Updates for KCOV instrumentation on x86:
- Prevent spurious KCOV coverage in common_interrupt()
- Fixup the KCOV Makefile directive which got stale due to a source
file rename
- Exclude stack unwinding from KCOV as it creates large amounts of
uninteresting coverage
- Provide a self test to validate that KCOV coverage of the interrupt
handling code starts not before preempt count got updated"
* tag 'x86-build-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Ignore stack unwinding in KCOV
module: Fix KCOV-ignored file name
kcov: Add interrupt handling self test
x86/entry: Remove unwanted instrumentation in common_interrupt()
- Core:
- Overhaul of posix-timers in preparation of removing the
workaround for periodic timers which have signal delivery
ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep
time since the large rewrite to a non-cascading wheel, but the
extra jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack
for real time tasks and enforce it at the core level instead of
having inconsistent individual checks in various timer setup
functions.
- The usual set of updates and enhancements all over the place.
- Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v
14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK
ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt
xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW
V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS
FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb
zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x
Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1
T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why
0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K
Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G
F6p9cgoDNP9KFg==
=jE0N
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Core:
- Overhaul of posix-timers in preparation of removing the workaround
for periodic timers which have signal delivery ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep time
since the large rewrite to a non-cascading wheel, but the extra
jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack for
real time tasks and enforce it at the core level instead of having
inconsistent individual checks in various timer setup functions.
- The usual set of updates and enhancements all over the place.
Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers"
* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
ntp: Make sure RTC is synchronized when time goes backwards
treewide: Fix wrong singular form of jiffies in comments
cpu: Use already existing usleep_range()
timers: Rename next_expiry_recalc() to be unique
platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
clocksource/drivers/jcore: Use request_percpu_irq()
clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
clocksource: acpi_pm: Add external callback for suspend/resume
clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
dt-bindings: timer: rockchip: Add rk3576 compatible
timers: Annotate possible non critical data race of next_expiry
timers: Remove historical extra jiffie for timeout in msleep()
hrtimer: Use and report correct timerslack values for realtime tasks
hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
timers: Add sparse annotation for timer_sync_wait_running().
signal: Replace BUG_ON()s
...
- Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef).
- Add support for Granite Rapids and Sierra Forest in OOB mode to the
intel_pstate cpufreq driver (Srinivas Pandruvada).
- Add basic support for CPU capacity scaling on x86 and make the
intel_pstate driver set asymmetric CPU capacity on hybrid systems
without SMT (Rafael Wysocki).
- Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq
driver (Jeff Johnson).
- Several OF related cleanups in cpufreq drivers (Rob Herring).
- Enable COMPILE_TEST for ARM drivers (Rob Herrring).
- Introduce quirks for syscon failures and use socinfo to get revision
for TI cpufreq driver (Dhruva Gole, Nishanth Menon).
- Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay
Ugwekar).
- Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers
(Danila Tikhonov, Huacai Chen, and Liu Jing).
- Make amd-pstate validate return of any attempt to update EPP limits,
which fixes the masking hardware problems (Mario Limonciello).
- Move the calculation of the AMD boost numerator outside of amd-pstate,
correcting acpi-cpufreq on systems with preferred cores (Mario
Limonciello).
- Harden preferred core detection in amd-pstate to avoid potential
false positives (Mario Limonciello).
- Add extra unit test coverage for mode state machine (Mario
Limonciello).
- Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang Liu).
- Add Granite Rapids Xeon support to intel_idle (Artem Bityutskiy).
- Disable promotion to C1E on Jasper Lake and Elkhart Lake in
intel_idle (Kai-Heng Feng).
- Use scoped device node handling to fix missing of_node_put() and
simplify walking OF children in the riscv-sbi cpuidle driver (Krzysztof
Kozlowski).
- Remove dead code from cpuidle_enter_state() (Dhruva Gole).
- Change an error pointer to NULL to fix error handling in the
intel_rapl power capping driver (Dan Carpenter).
- Fix off by one in get_rpi() in the intel_rapl power capping
driver (Dan Carpenter).
- Add support for ArrowLake-U to the intel_rapl power capping
driver (Sumeet Pawnikar).
- Fix the energy-pkg event for AMD CPUs in the intel_rapl power capping
driver (Dhananjay Ugwekar).
- Add support for AMD family 1Ah processors to the intel_rapl power
capping driver (Dhananjay Ugwekar).
- Remove unused stub for saveable_highmem_page() and remove deprecated
macros from power management documentation (Andy Shevchenko).
- Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM
sysfs interface (Xueqin Luo).
- Update the maintainers information for the operating-points-v2-ti-cpu DT
binding (Dhruva Gole).
- Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring).
- Add missing MODULE_DESCRIPTION() macros to devfreq governors (Jeff
Johnson).
- Use devm_clk_get_enabled() in the exynos-bus devfreq driver (Anand
Moon).
- Use of_property_present() instead of of_get_property() in the imx-bus
devfreq driver (Rob Herring).
- Update directory handling and installation process in the pm-graph
Makefile and add .gitignore to ignore sleepgraph.py artifacts to
pm-graph (Amit Vadhavana, Yo-Jung Lin).
- Make cpupower display residency value in idle-info (Aboorva
Devarajan).
- Add missing powercap_set_enabled() stub function to cpupower (John
B. Wyatt IV).
- Add SWIG support to cpupower (John B. Wyatt IV).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmbjKEQSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx8g8P/1RqL6NuCxH4eobwZigeyBS6/sLHPmKo
wqHcerZsU7EH8DOlmBU0SH1Br2WBQAbaP8d1ukT5qkGBrZ+IM/A2ipZct0yAHH2D
aBKwg7V3LvXo2mPuLve0knpM6W7zibPHJJlcjh8DmGQJabhWO7jr+p/0eS4JE2ek
iE5FCXTxhvbcNJ9yWSt7+3HHmvj74P81As7txysLSzhWSZDcqXb0XJRgVJnWDt+x
OyTAMEEAY2BuqmijHzqxxHcA1fxOBK/pa9yfPdKP7ePynLnpP7xd9A5oLbXQ4BL9
PHqpD06ZBdSMQzKkyCODypZt8PL+FcEALE4u9chV/nzVwp7TrtDneXWA7RA0GXgq
mp9hm51GmdptRayePR3s4TCA6a2BUw3Ue4fgs6XF/bexNpc3nx0wtP8HEevcuy8q
Z7XQkpqW942vOohfoN42JwTjfDJhYTwSH3dcIY8UghHtzwZ5YKV1M4f97kNR7V2i
QLJvaGJ5yTTcaHndkpc4EKknPyLRaWPh8h/yVmMRBcAaGBWaImul3a5NI07f0wLM
LTenlpEcls7WSu9n3uvFXvT7nSS2CBV0huTbg449X4T2J0T6EooYsVuHNsFMNFLy
Xm3lUtdm5QjAXFf+azOCO+26XQt8wObC0ttZtCC2j1b8D+9Riuwh5QHLr99rRTzn
7Ic4U5Lkimzx
=JM+K
-----END PGP SIGNATURE-----
Merge tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"By the number of new lines of code, the most visible change here is
the addition of hybrid CPU capacity scaling support to the
intel_pstate driver. Next are the amd-pstate driver changes related to
the calculation of the AMD boost numerator and preferred core
detection.
As far as new hardware support is concerned, the intel_idle driver
will now handle Granite Rapids Xeon processors natively, the
intel_rapl power capping driver will recognize family 1Ah of AMD
processors and Intel ArrowLake-U chipos, and intel_pstate will handle
Granite Rapids and Sierra Forest chips in the out-of-band (OOB) mode.
Apart from the above, there is a usual collection of assorted fixes
and code cleanups in many places and there are tooling updates.
Specifics:
- Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef)
- Add support for Granite Rapids and Sierra Forest in OOB mode to the
intel_pstate cpufreq driver (Srinivas Pandruvada)
- Add basic support for CPU capacity scaling on x86 and make the
intel_pstate driver set asymmetric CPU capacity on hybrid systems
without SMT (Rafael Wysocki)
- Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq
driver (Jeff Johnson)
- Several OF related cleanups in cpufreq drivers (Rob Herring)
- Enable COMPILE_TEST for ARM drivers (Rob Herrring)
- Introduce quirks for syscon failures and use socinfo to get
revision for TI cpufreq driver (Dhruva Gole, Nishanth Menon)
- Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay
Ugwekar)
- Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers
(Danila Tikhonov, Huacai Chen, and Liu Jing)
- Make amd-pstate validate return of any attempt to update EPP
limits, which fixes the masking hardware problems (Mario
Limonciello)
- Move the calculation of the AMD boost numerator outside of
amd-pstate, correcting acpi-cpufreq on systems with preferred cores
(Mario Limonciello)
- Harden preferred core detection in amd-pstate to avoid potential
false positives (Mario Limonciello)
- Add extra unit test coverage for mode state machine (Mario
Limonciello)
- Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang
Liu)
- Add Granite Rapids Xeon support to intel_idle (Artem Bityutskiy)
- Disable promotion to C1E on Jasper Lake and Elkhart Lake in
intel_idle (Kai-Heng Feng)
- Use scoped device node handling to fix missing of_node_put() and
simplify walking OF children in the riscv-sbi cpuidle driver
(Krzysztof Kozlowski)
- Remove dead code from cpuidle_enter_state() (Dhruva Gole)
- Change an error pointer to NULL to fix error handling in the
intel_rapl power capping driver (Dan Carpenter)
- Fix off by one in get_rpi() in the intel_rapl power capping driver
(Dan Carpenter)
- Add support for ArrowLake-U to the intel_rapl power capping driver
(Sumeet Pawnikar)
- Fix the energy-pkg event for AMD CPUs in the intel_rapl power
capping driver (Dhananjay Ugwekar)
- Add support for AMD family 1Ah processors to the intel_rapl power
capping driver (Dhananjay Ugwekar)
- Remove unused stub for saveable_highmem_page() and remove
deprecated macros from power management documentation (Andy
Shevchenko)
- Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM
sysfs interface (Xueqin Luo)
- Update the maintainers information for the
operating-points-v2-ti-cpu DT binding (Dhruva Gole)
- Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring)
- Add missing MODULE_DESCRIPTION() macros to devfreq governors (Jeff
Johnson)
- Use devm_clk_get_enabled() in the exynos-bus devfreq driver (Anand
Moon)
- Use of_property_present() instead of of_get_property() in the
imx-bus devfreq driver (Rob Herring)
- Update directory handling and installation process in the pm-graph
Makefile and add .gitignore to ignore sleepgraph.py artifacts to
pm-graph (Amit Vadhavana, Yo-Jung Lin)
- Make cpupower display residency value in idle-info (Aboorva
Devarajan)
- Add missing powercap_set_enabled() stub function to cpupower (John
B. Wyatt IV)
- Add SWIG support to cpupower (John B. Wyatt IV)"
* tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (62 commits)
cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue
cpufreq/amd-pstate-ut: Add test case for mode switches
cpufreq/amd-pstate: Export symbols for changing modes
amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking`
cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore`
cpufreq: amd-pstate: Optimize amd_pstate_update_limits()
cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()
x86/amd: Move amd_get_highest_perf() out of amd-pstate
ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn
ACPI: CPPC: Drop check for non zero perf ratio
x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()
ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB
x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c
PM: hibernate: Remove unused stub for saveable_highmem_page()
pm:cpupower: Add error warning when SWIG is not installed
MAINTAINERS: Add Maintainers for SWIG Python bindings
pm:cpupower: Include test_raw_pylibcpupower.py
pm:cpupower: Add SWIG bindings files for libcpupower
pm:cpupower: Add missing powercap_set_enabled() stub function
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmbiE0YACgkQaDWVMHDJ
krCmUw//T2NZu0k3H7z2AyBvLlxpdN61tZVZ9UArw71u6PNmDPhhU4Idt/vyidoM
x0+tGswjpIBgxpt/qU2oN0rYMqKO0Dnwdnbw7u1Wfr+ldHYD3jupgzdQtNvCs70P
U8qQZN4ltgppYXIEFnfCXoypaiIafyPiRJhR0YZQoVJ75uwbRB2Vu2ax5n1dak4u
Wkwb55X0ucu2Q93z51tISdtUQQ8+yEytbXP5blu77GCtDf6ZPOFSF/VsBjKU6lER
XQv7H2ReMUaYrPxvn7z60AApsYVDcbOwC0BDe1FmlNllmLlxxoThpfUMX+9+0pAs
szHzta5ZZ83VXoFpVzbLIaEvKJZSrksi4EEsfr1qxEzo1QgTrONWt79OFH3GBi/i
mMug+3vqlVKdx+YoHhZ+e4UcDftz4gqWEwvrlxh0CLomaprZU5ENDF8K53AYVa3g
whnWzCG3fEAdIfFJ3Jfxw6U0mk8l7AnOM98vJK4Wa7faErJGi1nwNkWScmpYEMMP
mJf0TOJZ3fXire51Ivq/xA+xsdb/P2h2nzbUZlaZ3vrGN8jBuglsHZtm9c/Rk+dC
y7/peyPgFGL/1ngOKzzmz6mEQc7POJBKYYuiOe0MEwO3O2YtvK2hAeiL30GPJ31+
lkXC/F8BwNdxaxcE8KGsEUqFpV3ynvS61Oqvl8CQhYmE8JaAAII=
=c73j
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Dave Hansen:
"These fix a deadlock in the SGX NUMA allocator.
It's probably only triggerable today on servers with buggy BIOSes, but
it's theoretically possible it can happen on less goofy systems"
* tag 'x86_sgx_for_6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Log information when a node lacks an EPC section
x86/sgx: Fix deadlock in SGX NUMA node search
configurations and scenarios where the mitigations code is irrelevant
- Other small fixlets and improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmbfDhUACgkQEsHwGGHe
VUrF9A//UkVKmIihXXak0GPqFhu8XrWeYlmwLxWe/uIy2hZCLp9L7n4pg0Ikxqz3
9D9hYk+Jykfu/jsv0sR6LH6OAUTlJi+P0w3x3VeL1sgFPUkwFtOaN2v/t5H3SW5r
l+VQpdUXPmLH6QbhvT84U6L/OQYr2cjhiYro47uwM9vO/SNao4HcbC/pdBr2dwxM
KzzA9sEDg3Le391phIhEOIogA1lPNV7KMScg2VjPTqQzEJ3NQVzyYmqjPO70sN9F
sAuksdF+rnPjc9K/W+qUcvlp8e9lDB8g0oPlyoOeubjXsnZU5YchriPdBbyAl0dJ
bjpftXIrBj8Vtmh7Tc0Jx2tlMFXNT5FrzcqdD4sviLnhrKEJSkwAoFgIMp5A+tN8
Kl8MrlABO8I8+zGRQB7TzhwaCC4AxCqUS3UEcYd4CBf5AWqT5i12ijbtIxPtdpG4
5itngIV4HT8casudpC8i8OTjOTggorMa7Pu/bQULhnZwagH8chlBdoOlKKQVkeVG
FUi+L/BljL9mASic7NRZI11tk44m9xWWkbbJOPlZaGJw9YzGrxD0YOfhbgcc9iaX
SOUMVJEhJVJMBISGiBUQDB6r51ee6B8RKJ3ByxzpAbwsUR9cXyfSYfUyE5reQJy9
3luj/iorL3guYU6EGEAtvbuTLGbKqybrV6zOB/QRXHWyhtUgrUA=
=GFld
-----END PGP SIGNATURE-----
Merge tag 'x86_bugs_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 hw mitigation updates from Borislav Petkov:
- Add CONFIG_ option for every hw CPU mitigation. The intent is to
support configurations and scenarios where the mitigations code is
irrelevant
- Other small fixlets and improvements
* tag 'x86_bugs_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Fix handling when SRSO mitigation is disabled
x86/bugs: Add missing NO_SSB flag
Documentation/srso: Document a method for checking safe RET operates properly
x86/bugs: Add a separate config for GDS
x86/bugs: Remove GDS Force Kconfig option
x86/bugs: Add a separate config for SSB
x86/bugs: Add a separate config for Spectre V2
x86/bugs: Add a separate config for SRBDS
x86/bugs: Add a separate config for Spectre v1
x86/bugs: Add a separate config for RETBLEED
x86/bugs: Add a separate config for L1TF
x86/bugs: Add a separate config for MMIO Stable Data
x86/bugs: Add a separate config for TAA
x86/bugs: Add a separate config for MDS
which include the vendor and finally drop the old ones which hardcode family 6
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmbe/GYACgkQEsHwGGHe
VUqCcw//Y0HgpGpCzi7/WraEI0kzqduV4SCm2louK8MnkgSIVVrccSg6rTWapvs9
Fxqyg6ZfTNxMuSEexSX9NMc7Nq7nm3m1JPztsZKcwur3fnfwPoxWjfR89dmnXbo6
iUYbolgiPMUo8S18NSXyEaopMwPJYSV/lvMRPclrUsAFhy40+kQcXYVMvP1BAw2z
Pi1TuRqMViVf5lzC71Xy/VntoAgWOtIEPnCLXeOLGIPkRZW+T/jUyTe6xFBOqjrg
BAWfVH8U2Smf2eNzPqO0RDQttSYl6GWcz9bJIPihmlMpFuACSH9j0UadjAMPCVKp
Th0uLxIaEWL7QV7qfSmWm0W79FZAhfJbA+EEKDQrUr+jgTEDE2r2hL7JVo2y8bHV
3nXdaUTnyC0oFr0FPl8yRVk4RN23Uj+fB1m6CCkFnZZQ5xIGT5FERGqut6vEwJlV
fAR8LioKMfRD7q/iQqw/iqMAi8SI0/YQ7R3HGYf6gnjkO86j4snWEdnpWHTraAlR
y24CSUrJ1hh8FRl/ISj56fB6efPm4Ef/znd9CRhWoIaLMgEV8ICDDVkH8RBePaGK
8D83mA/l1WJTAyyAUs6bu96x1TVWK+0xsazQmNJjPeh/mG54mmmrl9wK8YUaK0r4
NasmpovQ7M0QQx5IkFgx4oR84r179pHF246phSV1nrLpX8/EAzQ=
=0KAq
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
- Add the final conversions to the new Intel VFM CPU model matching
macros which include the vendor and finally drop the old ones which
hardcode family 6
* tag 'x86_cpu_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/vfm: Delete all the *_FAM6_ CPU #defines
x86/cpu/vfm: Delete X86_MATCH_INTEL_FAM6_MODEL[_STEPPING]() macros
extcon: axp288: Switch to new Intel CPU model defines
x86/cpu/intel: Replace PAT erratum model/family magic numbers with symbolic IFM references
reported through BIOS' BERT method can report the correct CPU number
the error has been detected on
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmbe79UACgkQEsHwGGHe
VUoC/xAAo0ODks0tfq2NR65i9LOpkzsi5hxpGeh71sBehM3/MY+PiIkoHhq2qKTu
Iwe85apPPl2mNAVspLZmIHmpdLNvcNtRThMrPuG5hwyt4vnX02JuSQa/Io8qYwMC
0JXeuJBx8rcHrynqCEU665WAwdgBRtOTNkVQ+EklHkS4Djahmu2p00+pvUu3+B4R
HDMcxfGhMTU/0LHvFaNPSqiWoaRJ1MmZMuiqnDwQTUGVkwwxeDQ8q5rnG3Tc7MVP
p12kKE98UaHikKK3p4YiVu1UshfQEzUsRHdROp6iVphxOrrDURSKybXjf6G2AvDC
/sRE94++jihi/3ULoboUCqSy5a1wiVrLG+JoQka6x66W4CUynGUCuYpa9WCfAsi/
4mvt5TH2C2Lz/9XbljYSs+64S6Yra40aM5zH0IRLMMSHBEL/mkQiXyLmtOajJRXR
vFmqlMO9lfWmADjsz5HzsxORpk/1EtZTbMbSXj56sv7ciE+eqnFLI0xaBMD8Z/Dm
ldiTuInCw9mfIreE+1h1vK44pFY+/d5veFe9Kfv39yFUgObnVZsm0uMyqmZaE565
T3ZVaQ3N/ghV6blQM+10wZNjs9EsVtv/iaoJSDbKJDcaK9B1BSUXOJ7j1VFeNFhe
Dmtn5uu0k5DoSPHjvDVHVltYR2YjEClX2bXhrnW+Cf6276BV4kk=
=TV+0
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Reorganize the struct mce populating functions so that MCA errors
reported through BIOS' BERT method can report the correct CPU number
the error has been detected on
* tag 'ras_core_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Use mce_prep_record() helpers for apei_smca_report_x86_error()
x86/mce: Define mce_prep_record() helpers for common and per-CPU fields
x86/mce: Rename mce_setup() to mce_prep_record()
model and stepping encoded in the patch revision number
- Fix a silly clang warning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmbe7qkACgkQEsHwGGHe
VUrUFw/+Npl6FuQY9B18TUgTg64ln41fYtMIFpV3Gn64Ny/LhAJ7kyMbDrxH9nCV
rwlCrqsyek0tIFWSslTuvTbrjK+omOLmhlJRlrYQ4V+lEWtliTOenQ35vkBwf3wS
AVnKEhsbe2SWD2eV5kJPpGdNwuZiGhg8t8ZD959OPbMZkyEZ+Rz3KCGYKO5L+5a4
CHGDnM+HOGhCQ4mek8Rya8aFWNWb7eh6CmGjDTYfAGE5AIoNeNRejruRrFXHZIff
N7LlNMfqlTDWLx1Q26OXL9wes3PryNrUiAyTuDQnrS74E5OyvjzsyTW0rirmdFEa
UfcPxedStj8Cse6nJfR0yaprAoTH6eCHkzj2sPcY8dcl8jhq9ChE2T2yjGSX642f
4zneXA2kFYRpw6E+Y5qqB9kViEZiyUaSZ5LasucqE5TrZwaBPaXMBo3WqhvKRMuc
mjH//Mo8CPNN4RpFk+1Ii8KnTyOE41WbMEJuzqdfQnzKJ2X5xxa6HZB7oHzne/HI
tHEWJCInoRz8losvXPICJb20AKu/8vIS2F5ROXNCDPIAw/Fl+UT1prH4+Wo2nZB+
8wElMzqTaWVcaQ2nAaUDSYompimbYCtgB3KWt9WLnBuHsXVbOQkdNyL7+bcQjV39
KXVxo5QZlqc1Oqea+BURJ7BBq6VOssFiUeg8dW0FE4xzT3CS4N8=
=kPyL
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading updates from Borislav Petkov:
- Simplify microcode patches loading on AMD Zen and newer by using the
family, model and stepping encoded in the patch revision number
- Fix a silly clang warning
* tag 'x86_microcode_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Fix a -Wsometimes-uninitialized clang false positive
x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID
When running as a Xen PV dom0 the system needs to map ACPI data of the
host using host physical addresses, while those addresses can conflict
with the guest physical addresses of the loaded linux kernel. The same
problem might apply in case a PV guest is configured to use the host
memory map.
This conflict can be solved by mapping the ACPI data to a different
guest physical address, but mapping the data via acpi_os_ioremap()
must still be possible using the host physical address, as this
address might be generated by AML when referencing some of the ACPI
data.
When configured to support running as a Xen PV domain, have an
implementation of acpi_os_ioremap() being aware of the possibility to
need above mentioned translation of a host physical address to the
guest physical address.
This modification requires to #include linux/acpi.h in some sources
which need to include asm/acpi.h directly.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Merge cpufreq updates for 6.12-rc1:
- Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef).
- Add support for Granite Rapids and Sierra Forest in OOB mode to the
intel_pstate cpufreq driver (Srinivas Pandruvada).
- Add basic support for CPU capacity scaling on x86 and make the
intel_pstate driver set asymmetric CPU capacity on hybrid systems
without SMT (Rafael Wysocki).
- Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq
driver (Jeff Johnson).
- Several OF related cleanups in cpufreq drivers (Rob Herring).
- Enable COMPILE_TEST for ARM drivers (Rob Herrring).
- Introduce quirks for syscon failures and use socinfo to get revision
for TI cpufreq driver (Dhruva Gole, Nishanth Menon).
- Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay
Ugwekar).
- Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers
(Danila Tikhonov, Huacai Chen, and Liu Jing).
- Make amd-pstate validate return of any attempt to update EPP limits,
which fixes the masking hardware problems (Mario Limonciello).
- Move the calculation of the AMD boost numerator outside of amd-pstate,
correcting acpi-cpufreq on systems with preferred cores (Mario
Limonciello).
- Harden preferred core detection in amd-pstate to avoid potential
false positives (Mario Limonciello).
- Add extra unit test coverage for mode state machine (Mario
Limonciello).
- Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang Liu).
* pm-cpufreq: (35 commits)
cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue
cpufreq/amd-pstate-ut: Add test case for mode switches
cpufreq/amd-pstate: Export symbols for changing modes
amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking`
cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore`
cpufreq: amd-pstate: Optimize amd_pstate_update_limits()
cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()
x86/amd: Move amd_get_highest_perf() out of amd-pstate
ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn
ACPI: CPPC: Drop check for non zero perf ratio
x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()
ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB
x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c
cpufreq/amd-pstate: Catch failures for amd_pstate_epp_update_limit()
cpufreq: ti-cpufreq: Use socinfo to get revision in AM62 family
cpufreq: Fix the cacography in powernv-cpufreq.c
cpufreq: ti-cpufreq: Introduce quirks to handle syscon fails appropriately
cpufreq: loongson3: Use raw_smp_processor_id() in do_service_request()
cpufreq: amd-pstate: add check for cpufreq_cpu_get's return value
...
* Move the calculation of the AMD boost numerator outside of
amd-pstate, correcting acpi-cpufreq on systems with preferred cores
* Harden preferred core detection to avoid potential false positives
* Add extra unit test coverage for mode state machine
-----BEGIN PGP SIGNATURE-----
iQJOBAABCgA4FiEECwtuSU6dXvs5GA2aLRkspiR3AnYFAmbhviEaHG1hcmlvLmxp
bW9uY2llbGxvQGFtZC5jb20ACgkQLRkspiR3AnYqDA//TrvmXcpk1mnVJw3Y7MG0
/n8dsLpxqVtEf+USnlGR+iRhgSQ/W/Kr7b5a+jmdCwpHChuWHt2FnNgcHLIxDnZC
vmEJ02/2BCRoPKvcvV4VTh0ATu3O9nqwQiBVWBdNjDy+Dzr0pzA+SQopt1hCIsO2
mzUodhpiBqYKlMf/i6+aM1gZCGGqoRC40aGqnJsgegb61vl7zIc2ZcbTxUQlyTfv
t6J73IXLx8+YtrjejBYc7mRHhMQ2hCKy92C/8cNoGocj5faSKsAA3OUDcWq8qX0U
zK3GGGdW8MLHSbt3VyntstnfiLL7TnzowcjvrMudIWpjC1987GlE9BApbN9VRZ8e
ARN3Y7/ltjut/1fRB97BwjI9aDpzA0122Qzy4UOcK8o+be1eIr+ihV3Z9EN/snWg
0L/oq5+rGHvvIzf1BwGhoPSvgBIu7eMIYDcRxKPlEiKsbXrL4DdJC/nXgaZ/HiGO
eHx1dNy7LFrdnEwVI1frZWC6ZuZcpmOBdhnfU+leVxzB3Z++Qc266rsxKBsc5taZ
PPV18pxfbbl3iL85KDIbuBUCmA0aY8WEdCKtfXpl7zlB5g0fZQLyYeUbvahK08Sk
vyQAnPECbX/4v1Vx54Z70GPk0XD2+TXdg8yApnXrmRc36z/SLdprk5hPKbKhZu/r
iPxFUnvd0HCtjsLrsq/qUiQ=
=R4HZ
-----END PGP SIGNATURE-----
Merge tag 'amd-pstate-v6.12-2024-09-11' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Merge the second round of amd-pstate changes for 6.12 from Mario
Limonciello:
"* Move the calculation of the AMD boost numerator outside of
amd-pstate, correcting acpi-cpufreq on systems with preferred cores
* Harden preferred core detection to avoid potential false positives
* Add extra unit test coverage for mode state machine"
* tag 'amd-pstate-v6.12-2024-09-11' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux:
cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue
cpufreq/amd-pstate-ut: Add test case for mode switches
cpufreq/amd-pstate: Export symbols for changing modes
amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking`
cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore`
cpufreq: amd-pstate: Optimize amd_pstate_update_limits()
cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()
x86/amd: Move amd_get_highest_perf() out of amd-pstate
ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn
ACPI: CPPC: Drop check for non zero perf ratio
x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()
ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB
x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c
The special case in amd_pstate_highest_perf_set() is the value used
for calculating the boost numerator. Merge this into
amd_get_boost_ratio_numerator() and then use that to calculate boost
ratio.
This allows dropping more special casing of the highest perf value.
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
AMD systems that support preferred cores will use "166" as their
numerator for max frequency calculations instead of "255".
Add a function for detecting preferred cores by looking at the
highest perf value on all cores.
If preferred cores are enabled return 166 and if disabled the
value in the highest perf register. As the function will be called
multiple times, cache the values for the boost numerator and if
preferred cores will be enabled in global variables.
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
amd_pstate_get_highest_perf() is a helper used to get the highest perf
value on AMD systems. It's used in amd-pstate as part of preferred
core handling, but applicable for acpi-cpufreq as well.
Move it out to cppc handling code as amd_get_highest_perf().
Reviewed-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
If the boost ratio isn't calculated properly for the system for any
reason this can cause other problems that are non-obvious.
Raise all messages to warn instead.
Suggested-by: Perry Yuan <Perry.Yuan@amd.com>
Reviewed-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
perf_ratio is a u64 and SCHED_CAPACITY_SCALE is a large number.
Shifting by one will never have a zero value.
Drop the check.
Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Gautham R. Shenoy <gautham.sheoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
The function name is ambiguous because it returns an intermediate value
for calculating maximum frequency rather than the CPPC 'Highest Perf'
register.
Rename the function to clarify its use and allow the function to return
errors. Adjust the consumer in acpi-cpufreq to catch errors.
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
To prepare to let amd_get_highest_perf() detect preferred cores
it will require CPPC functions. Move amd_get_highest_perf() to
cppc.c to prepare for 'preferred core detection' rework.
No functional changes intended.
Reviewed-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Patch series "mm: Care about shadow stack guard gap when getting an
unmapped area", v2.
As covered in the commit log for c44357c2e7 ("x86/mm: care about shadow
stack guard gap during placement") our current mmap() implementation does
not take care to ensure that a new mapping isn't placed with existing
mappings inside it's own guard gaps. This is particularly important for
shadow stacks since if two shadow stacks end up getting placed adjacent to
each other then they can overflow into each other which weakens the
protection offered by the feature.
On x86 there is a custom arch_get_unmapped_area() which was updated by the
above commit to cover this case by specifying a start_gap for allocations
with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and
use the generic implementation of arch_get_unmapped_area() so let's make
the equivalent change there so they also don't get shadow stack pages
placed without guard pages. The arm64 and RISC-V shadow stack
implementations are currently on the list:
https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
Given the addition of the use of vm_flags in the generic implementation we
also simplify the set of possibilities that have to be dealt with in the
core code by making arch_get_unmapped_area() take vm_flags as standard.
This is a bit invasive since the prototype change touches quite a few
architectures but since the parameter is ignored the change is
straightforward, the simplification for the generic code seems worth it.
This patch (of 3):
When we introduced arch_get_unmapped_area_vmflags() in 961148704a ("mm:
introduce arch_get_unmapped_area_vmflags()") we did so as part of properly
supporting guard pages for shadow stacks on x86_64, which uses a custom
arch_get_unmapped_area(). Equivalent features are also present on both
arm64 and RISC-V, both of which use the generic implementation of
arch_get_unmapped_area() and will require equivalent modification there.
Rather than continue to deal with having two versions of the functions
let's bite the bullet and have all implementations of
arch_get_unmapped_area() take vm_flags as a parameter.
The new parameter is currently ignored by all implementations other than
x86. The only caller that doesn't have a vm_flags available is
mm_get_unmapped_area(), as for the x86 implementation and the wrapper used
on other architectures this is modified to supply no flags.
No functional changes.
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Helge Deller <deller@gmx.de> [parisc]
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmbeRpsTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXsDDB/4oL6ypxiF3/yo+xR6bt8HlzIfcVeTx
EuDR+a/hDRQdShMbNtgaF2OxovMO1W5Se2hCoNrKbVxrPRHL6gUuZASdm93l75eh
l8I0muQif1q9rEXNbwQxe/ydE0860OgmE/ZGv944BXBtirG1fGHei1DNKkdL6VJy
iEmmURwz7Ykg5neqwzYBY9SV7P/wwWZNR8GIRTWHhWU+ok1cYpehAs1dpQleAxsz
WZCQLfIMXdSJBSDB/YO7JAlykZ1DkkTkI8pfbe2diReaDSw2QYsnsPXD6MVZArLO
73kDojwb0LitLyWYEjm07ipOApkzYrEGTXjlLNdUVVF1Fx20nohu8jRd
=k5P6
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Add a documentation overview of Confidential Computing VM support
(Michael Kelley)
- Use lapic timer in a TDX VM without paravisor (Dexuan Cui)
- Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
(Michael Kelley)
- Fix a kexec crash due to VP assist page corruption (Anirudh
Rayabharam)
- Python3 compatibility fix for lsvmbus (Anthony Nandaa)
- Misc fixes (Rachel Menge, Roman Kisel, zhang jiao, Hongbo Li)
* tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hv: vmbus: Constify struct kobj_type and struct attribute_group
tools: hv: rm .*.cmd when make clean
x86/hyperv: fix kexec crash due to VP assist page corruption
Drivers: hv: vmbus: Fix the misplaced function description
tools: hv: lsvmbus: change shebang to use python3
x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
Documentation: hyperv: Add overview of Confidential Computing VM support
clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor
Drivers: hv: Remove deprecated hv_fcopy declarations
There are several comments all over the place, which uses a wrong singular
form of jiffies.
Replace 'jiffie' by 'jiffy'. No functional change.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
For optimized performance, firmware typically distributes EPC sections
evenly across different NUMA nodes. However, there are scenarios where
a node may have both CPUs and memory but no EPC section configured. For
example, in an 8-socket system with a Sub-Numa-Cluster=2 setup, there
are a total of 16 nodes. Given that the maximum number of supported EPC
sections is 8, it is simply not feasible to assign one EPC section to
each node. This configuration is not incorrect - SGX will still operate
correctly; it is just not optimized from a NUMA standpoint.
For this reason, log a message when a node with both CPUs and memory
lacks an EPC section. This will provide users with a hint as to why they
might be experiencing less-than-ideal performance when running SGX
enclaves.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/all/20240905080855.1699814-3-aaron.lu%40intel.com
When the current node doesn't have an EPC section configured by firmware
and all other EPC sections are used up, CPU can get stuck inside the
while loop that looks for an available EPC page from remote nodes
indefinitely, leading to a soft lockup. Note how nid_of_current will
never be equal to nid in that while loop because nid_of_current is not
set in sgx_numa_mask.
Also worth mentioning is that it's perfectly fine for the firmware not
to setup an EPC section on a node. While setting up an EPC section on
each node can enhance performance, it is not a requirement for
functionality.
Rework the loop to start and end on *a* node that has SGX memory. This
avoids the deadlock looking for the current SGX-lacking node to show up
in the loop when it never will.
Fixes: 901ddbb9ec ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Reported-by: "Molina Sabido, Gerardo" <gerardo.molina.sabido@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Zhimin Luo <zhimin.luo@intel.com>
Link: https://lore.kernel.org/all/20240905080855.1699814-2-aaron.lu%40intel.com
When the SRSO mitigation is disabled, either via mitigations=off or
spec_rstack_overflow=off, the warning about the lack of IBPB-enhancing
microcode is printed anyway.
This is unnecessary since the user has turned off the mitigation.
[ bp: Massage, drop SBPB rationale as it doesn't matter because when
mitigations are disabled x86_pred_cmd is not being used anyway. ]
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240904150711.193022-1-david.kaplan@amd.com
commit 9636be85cc ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when
CPUs go online/offline") introduces a new cpuhp state for hyperv
initialization.
cpuhp_setup_state() returns the state number if state is
CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN and 0 for all other states.
For the hyperv case, since a new cpuhp state was introduced it would
return 0. However, in hv_machine_shutdown(), the cpuhp_remove_state() call
is conditioned upon "hyperv_init_cpuhp > 0". This will never be true and
so hv_cpu_die() won't be called on all CPUs. This means the VP assist page
won't be reset. When the kexec kernel tries to setup the VP assist page
again, the hypervisor corrupts the memory region of the old VP assist page
causing a panic in case the kexec kernel is using that memory elsewhere.
This was originally fixed in commit dfe94d4086 ("x86/hyperv: Fix kexec
panic/hang issues").
Get rid of hyperv_init_cpuhp entirely since we are no longer using a
dynamic cpuhp state and use CPUHP_AP_HYPERV_ONLINE directly with
cpuhp_remove_state().
Cc: stable@vger.kernel.org
Fixes: 9636be85cc ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline")
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20240828112158.3538342-1-anirudh@anirudhrb.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240828112158.3538342-1-anirudh@anirudhrb.com>
In order be able to compute the sizes of tasks consistently across all
CPUs in a hybrid system, it is necessary to provide CPU capacity scaling
information to the scheduler via arch_scale_cpu_capacity(). Moreover,
the value returned by arch_scale_freq_capacity() for the given CPU must
correspond to the arch_scale_cpu_capacity() return value for it, or
utilization computations will be inaccurate.
Add support for it through per-CPU variables holding the capacity and
maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will
be returned by arch_scale_cpu_capacity() and used by scale_freq_tick()
to compute arch_freq_scale for the current CPU, respectively.
In order to avoid adding measurable overhead for non-hybrid x86 systems,
which are the vast majority in the field, whether or not the new hybrid
CPU capacity scaling will be in effect is controlled by a static key.
This static key is set by calling arch_enable_hybrid_capacity_scale()
which also allocates memory for the per-CPU data and initializes it.
Next, arch_set_cpu_capacity() is used to set the per-CPU variables
mentioned above for each CPU and arch_rebuild_sched_domains() needs
to be called for the scheduler to realize that capacity-aware
scheduling can be used going forward.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance
Link: https://patch.msgid.link/10523497.nUPlyArG6x@rjwysocki.net
[ rjw: Added parens to function kerneldoc comments ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There's an erratum that prevents the PAT from working correctly:
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf
# Document 316515 Version 010
The kernel currently disables PAT support on those CPUs, but it
does it with some magic numbers.
Replace the magic numbers with the new "IFM" macros.
Make the check refer to the last affected CPU (INTEL_CORE_YONAH)
rather than the first fixed one. This makes it easier to find the
documentation of the erratum since Intel documents where it is
broken and not where it is fixed.
I don't think the Pentium Pro (or Pentium II) is actually affected.
But the old check included them, so it can't hurt to keep doing the
same. I'm also not completely sure about the "Pentium M" CPUs
(models 0x9 and 0xd). But, again, they were included in in the
old checks and were close Pentium III derivatives, so are likely
affected.
While we're at it, revise the comment referring to the erratum name
and making sure it is a quote of the language from the actual errata
doc. That should make it easier to find in the future when the URL
inevitably changes.
Why bother with this in the first place? It actually gets rid of one
of the very few remaining direct references to c->x86{,_model}.
No change in functionality intended.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Link: https://lore.kernel.org/r/20240829220042.1007820-1-dave.hansen@linux.intel.com
Sparse expect an __iomem pointer, but after converting the EISA probe to
memremap() the pointer is a regular memory pointer. Access it directly
instead.
[ tglx: Converted it to fix the already applied version ]
Fixes: 80a4da0564 ("x86/EISA: Use memremap() to probe for the EISA BIOS signature")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/alpine.DEB.2.21.2408261015270.30766@angie.orcam.me.uk
When using resctrl on systems with Sub-NUMA Clustering enabled, monitoring
groups may be allocated RMID values which would overrun the
arch_mbm_{local,total} arrays.
This is due to inconsistencies in whether the SNC-adjusted num_rmid value or
the unadjusted value in resctrl_arch_system_num_rmid_idx() is used. The
num_rmid value for the L3 resource is currently:
resctrl_arch_system_num_rmid_idx() / snc_nodes_per_l3_cache
As a simple fix, make resctrl_arch_system_num_rmid_idx() return the
SNC-adjusted, L3 num_rmid value on x86.
Fixes: e13db55b5a ("x86/resctrl: Introduce snc_nodes_per_l3_cache")
Signed-off-by: Peter Newman <peternewman@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240822190212.1848788-1-peternewman@google.com
The FRED RSP0 MSR points to the top of the kernel stack for user level
event delivery. As this is the task stack it needs to be updated when a
task is scheduled in.
The update is done at context switch. That means it's also done when
switching to kernel threads, which is pointless as those never go out to
user space. For KVM threads this means there are two writes to FRED_RSP0 as
KVM has to switch to the guest value before VMENTER.
Defer the update to the exit to user space path and cache the per CPU
FRED_RSP0 value, so redundant writes can be avoided.
Provide fred_sync_rsp0() for KVM to keep the cache in sync with the actual
MSR value after returning from guest to host mode.
[ tglx: Massage change log ]
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240822073906.2176342-4-xin@zytor.com
Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
use the alternatives mechanism to choose WRMSRNS when it's available,
otherwise fallback to WRMSR.
Remove the dependency on X86_FEATURE_WRMSRNS as WRMSRNS is no longer
dependent on FRED.
[1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/
Use DS prefix to pad WRMSR instead of a NOP. The prefix is ignored. At
least that's the current information from the hardware folks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240822073906.2176342-3-xin@zytor.com
SS is initialized to NULL during boot time and not explicitly set to
__KERNEL_DS.
With FRED enabled, if a kernel event is delivered before a CPU goes to
user level for the first time, its SS is NULL thus NULL is pushed into
the SS field of the FRED stack frame. But before ERETS is executed,
the CPU may context switch to another task and go to user level. Then
when the CPU comes back to kernel mode, SS is changed to __KERNEL_DS.
Later when ERETS is executed to return from the kernel event handler,
a #GP fault is generated because SS doesn't match the SS saved in the
FRED stack frame.
Initialize SS to __KERNEL_DS when enabling FRED to prevent that.
Note, IRET doesn't check if SS matches the SS saved in its stack frame,
thus IDT doesn't have this problem. For IDT it doesn't matter whether
SS is set to __KERNEL_DS or not, because it's set to NULL upon interrupt
or exception delivery and __KERNEL_DS upon SYSCALL. Thus it's pointless
to initialize SS for IDT.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240816104316.2276968-1-xin@zytor.com
Add new PCI IDs for Device 18h and Function 4 to enable the amd_atl driver
on those systems.
Signed-off-by: Richard Gong <richard.gong@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/all/20240819123041.915734-1-richard.gong@amd.com
init_per_cpu_var() returns a pointer in the percpu address space while
rip_rel_ptr() expects a pointer in the generic address space.
When strict address space checks are enabled, GCC's named address space
checks fail:
asm.h:124:63: error: passing argument 1 of 'rip_rel_ptr' from
pointer to non-enclosed address space
Add a explicit cast to remove address space of the returned pointer.
Fixes: 11e36b0f7c ("x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]")
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240819083334.148536-1-ubizjak@gmail.com
When SGX is not supported by the BIOS, the kernel log contains the error
'SGX disabled by BIOS', which can be confusing since there might not be an
SGX-related option in the BIOS settings.
For the kernel it's difficult to distinguish between the BIOS not
supporting SGX and the BIOS supporting SGX but having it disabled.
Therefore, update the error message to 'SGX disabled or unsupported by
BIOS' to make it easier for those reading kernel logs to understand what's
happening.
Reported-by: Bo Wu <wubo@uniontech.com>
Co-developed-by: Zelong Xiang <xiangzelong@uniontech.com>
Signed-off-by: Zelong Xiang <xiangzelong@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/all/F8D977CB368423F3+20240825104653.1294624-1-wangyuli@uniontech.com
Closes: https://github.com/linuxdeepin/developer-center/issues/10032
The current assembly around swap_pages() in the relocate_kernel() takes
some time to follow because the use of registers can be easily lost when
the line of assembly goes long. Add a couple of comments to clarify the
code around swap_pages() to improve readability.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/8b52b0b8513a34b2a02fb4abb05c6700c2821475.1724573384.git.kai.huang@intel.com
When relocate_kernel() gets called, %rdi holds 'indirection_page' and
%rsi holds 'page_list'. And %rdi always holds 'indirection_page' when
swap_pages() is called.
Therefore the comment of the first line code of swap_pages()
movq %rdi, %rcx /* Put the page_list in %rcx */
.. isn't correct because it actually moves the 'indirection_page' to
the %rcx. Fix it.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/adafdfb1421c88efce04420fc9a996c0e2ca1b34.1724573384.git.kai.huang@intel.com
Building the SGX code with W=1 generates below warning:
arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or
struct member 'low' not described in 'sgx_calc_section_metric'
arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or
struct member 'high' not described in 'sgx_calc_section_metric'
...
The function sgx_calc_section_metric() is a simple helper which is only
used in sgx/main.c. There's no need to use kernel-doc style comment for
it.
Downgrade to a normal comment.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240825080649.145250-1-kai.huang@intel.com
The area at the 0x0FFFD9 physical location in the PC memory space is
regular memory, traditionally ROM BIOS and more recently a copy of BIOS
code and data in RAM, write-protected.
Therefore use memremap() to get access to it rather than ioremap(),
avoiding issues in virtualization scenarios and complementing changes such
as commit f7750a7956 ("x86, mpparse, x86/acpi, x86/PCI, x86/dmi, SFI: Use
memremap() for RAM mappings") or commit 5997efb967 ("x86/boot: Use
memremap() to map the MPF and MPC data").
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/alpine.DEB.2.21.2408242025210.30766@angie.orcam.me.uk
Closes: https://lore.kernel.org/r/20240822095122.736522-1-kirill.shutemov@linux.intel.com
Add defines for the architectural memory types that can be shoved into
various MSRs and registers, e.g. MTRRs, PAT, VMX capabilities MSRs, EPTPs,
etc. While most MSRs/registers support only a subset of all memory types,
the values themselves are architectural and identical across all users.
Leave the goofy MTRR_TYPE_* definitions as-is since they are in a uapi
header, but add compile-time assertions to connect the dots (and sanity
check that the msr-index.h values didn't get fat-fingered).
Keep the VMX_EPTP_MT_* defines so that it's slightly more obvious that the
EPTP holds a single memory type in 3 of its 64 bits; those bits just
happen to be 2:0, i.e. don't need to be shifted.
Opportunistically use X86_MEMTYPE_WB instead of an open coded '6' in
setup_vmcs_config().
No functional change intended.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Refactor the list of constant variables into a macro.
This should make it easier to add more constants in the future.
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
There are two distinct CPU features related to the use of XSAVES and LBR:
whether LBR is itself supported and whether XSAVES supports LBR. The LBR
subsystem correctly checks both in intel_pmu_arch_lbr_init(), but the
XSTATE subsystem does not.
The LBR bit is only removed from xfeatures_mask_independent when LBR is not
supported by the CPU, but there is no validation of XSTATE support.
If XSAVES does not support LBR the write to IA32_XSS causes a #GP fault,
leaving the state of IA32_XSS unchanged, i.e. zero. The fault is handled
with a warning and the boot continues.
Consequently the next XRSTORS which tries to restore supervisor state fails
with #GP because the RFBM has zero for all supervisor features, which does
not match the XCOMP_BV field.
As XFEATURE_MASK_FPSTATE includes supervisor features setting up the FPU
causes a #GP, which ends up in fpu_reset_from_exception_fixup(). That fails
due to the same problem resulting in recursive #GPs until the kernel runs
out of stack space and double faults.
Prevent this by storing the supported independent features in
fpu_kernel_cfg during XSTATE initialization and use that cached value for
retrieving the independent feature bits to be written into IA32_XSS.
[ tglx: Massaged change log ]
Fixes: f0dccc9da4 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mitchell Levy <levymitchell0@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240812-xsave-lbr-fix-v3-1-95bac1bf62f4@gmail.com
On 64-bit init_mem_mapping() relies on the minimal page fault handler
provided by the early IDT mechanism. The real page fault handler is
installed right afterwards into the IDT.
This is problematic on CPUs which have X86_FEATURE_FRED set because the
real page fault handler retrieves the faulting address from the FRED
exception stack frame and not from CR2, but that does obviously not work
when FRED is not yet enabled in the CPU.
To prevent this enable FRED right after init_mem_mapping() without
interrupt stacks. Those are enabled later in trap_init() after the CPU
entry area is set up.
[ tglx: Encapsulate the FRED details ]
Fixes: 14619d912b ("x86/fred: FRED entry/exit and dispatch code")
Reported-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240709154048.3543361-4-xin@zytor.com
To enable FRED earlier, move the RSP initialization out of
cpu_init_fred_exceptions() into cpu_init_fred_rsps().
This is required as the FRED RSP initialization depends on the availability
of the CPU entry areas which are set up late in trap_init(),
No functional change intended. Marked with Fixes as it's a depedency for
the real fix.
Fixes: 14619d912b ("x86/fred: FRED entry/exit and dispatch code")
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240709154048.3543361-3-xin@zytor.com
Depending on whether FRED is enabled, sysvec_install() installs a system
interrupt handler into either into FRED's system vector dispatch table or
into the IDT.
However FRED can be disabled later in trap_init(), after sysvec_install()
has been invoked already; e.g., the HYPERVISOR_CALLBACK_VECTOR handler is
registered with sysvec_install() in kvm_guest_init(), which is called in
setup_arch() but way before trap_init().
IOW, there is a gap between FRED is available and available but disabled.
As a result, when FRED is available but disabled, early sysvec_install()
invocations fail to install the IDT handler resulting in spurious
interrupts.
Fix it by parsing cmdline param "fred=" in cpu_parse_early_param() to
ensure that FRED is disabled before the first sysvec_install() incovations.
Fixes: 3810da1271 ("x86/fred: Add a fred= cmdline param")
Reported-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240709154048.3543361-2-xin@zytor.com
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <yuntao.wang@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240813014827.895381-1-yuntao.wang@linux.dev
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Logical destination mode of the local APIC is used for systems with up to
8 CPUs. It has an advantage over physical destination mode as it allows to
target multiple CPUs at once with IPIs.
That advantage was definitely worth it when systems with up to 8 CPUs
were state of the art for servers and workstations, but that's history.
Aside of that there are systems which fail to work with logical destination
mode as the ACPI/DMI quirks show and there are AMD Zen1 systems out there
which fail when interrupt remapping is enabled as reported by Rob and
Christian. The latter problem can be cured by firmware updates, but not all
OEMs distribute the required changes.
Physical destination mode is guaranteed to work because it is the only way
to get a CPU up and running via the INIT/INIT/STARTUP sequence.
As the number of CPUs keeps increasing, logical destination mode becomes a
less used code path so there is no real good reason to keep it around.
Therefore remove logical destination mode support for 64-bit and default to
physical destination mode.
Reported-by: Rob Newcater <rob@durendal.co.uk>
Reported-by: Christian Heusel <christian@heusel.eu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Rob Newcater <rob@durendal.co.uk>
Link: https://lore.kernel.org/all/877cd5u671.ffs@tglx
Add Bus Lock Detect (called Bus Lock Trap in AMD docs) support for AMD
platforms. Bus Lock Detect is enumerated with CPUID Fn0000_0007_ECX_x0
bit [24 / BUSLOCKTRAP]. It can be enabled through MSR_IA32_DEBUGCTLMSR.
When enabled, hardware clears DR6[11] and raises a #DB exception on
occurrence of Bus Lock if CPL > 0. More detail about the feature can be
found in AMD APM[1].
[1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
2023, Vol 2, 13.1.3.6 Bus Lock Trap
https://bugzilla.kernel.org/attachment.cgi?id=304653
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/all/20240808062937.1149-3-ravi.bangoria@amd.com
Bus Lock Detect functionality on AMD platforms works identical to Intel.
Move split_lock and bus_lock specific code from intel.c to a dedicated
file so that it can be compiled and supported on non-Intel platforms.
Also, introduce CONFIG_X86_BUS_LOCK_DETECT, make it dependent on
CONFIG_CPU_SUP_INTEL and add compilation dependency of the new bus_lock.c
file on CONFIG_X86_BUS_LOCK_DETECT.
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/all/20240808062937.1149-2-ravi.bangoria@amd.com
Stack unwinding produces large amounts of uninteresting coverage.
It's called from KASAN kmalloc/kfree hooks, fault injection, etc.
It's not particularly useful and is not a function of system call args.
Ignore that code.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/eaf54b8634970b73552dcd38bf9be6ef55238c10.1718092070.git.dvyukov@google.com
MTRRs have an obsolete fixed variant for fine grained caching control
of the 640K-1MB region that uses separate MSRs. This fixed variant has
a separate capability bit in the MTRR capability MSR.
So far all x86 CPUs which support MTRR have this separate bit set, so it
went unnoticed that mtrr_save_state() does not check the capability bit
before accessing the fixed MTRR MSRs.
Though on a CPU that does not support the fixed MTRR capability this
results in a #GP. The #GP itself is harmless because the RDMSR fault is
handled gracefully, but results in a WARN_ON().
Add the missing capability check to prevent this.
Fixes: 2b1f6278d7 ("[PATCH] x86: Save the MTRRs of the BSP before booting an AP")
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240808000244.946864-1-ak@linux.intel.com
The kernel can change spinlock behavior when running as a guest. But this
guest-friendly behavior causes performance problems on bare metal.
The kernel uses a static key to switch between the two modes.
In theory, the static key is enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior or paravirt spinlock).
A performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608b ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is
disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which decreases
the performance on bare metal.
Set the default value of virt_spin_lock_key to false. If booting in a VM,
enable this key. Later during the VM initialization, if other
high-efficient spinlock is preferred (e.g. paravirt-spinlock), or the user
wants the native qspinlock (via nopvspin boot commandline), the
virt_spin_lock_key is disabled accordingly.
This results in the following decision matrix:
X86_FEATURE_HYPERVISOR Y Y Y N
CONFIG_PARAVIRT_SPINLOCKS Y Y N Y/N
PV spinlock Y N N Y/N
virt_spin_lock_key N Y/N Y N
Fixes: ce0a1b608b ("x86/paravirt: Silence unused native_pv_lock_init() function warning")
Reported-by: Prem Nath Dey <prem.nath.dey@intel.com>
Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Suggested-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240806112207.29792-1-yu.c.chen@intel.com
On a platform using the "Multiprocessor Wakeup Structure"[1] to startup
secondary CPUs the control processor needs to memremap() the physical
address of the MP Wakeup Structure mailbox to the variable
acpi_mp_wake_mailbox, which holds the virtual address of mailbox.
To wake up the AP the control processor writes the APIC ID of AP, the
wakeup vector and the ACPI_MP_WAKE_COMMAND_WAKEUP command into the mailbox.
Current implementation doesn't consider the case which restricts boot time
CPU bringup to 1 with the kernel parameter "maxcpus=1" and brings other
CPUs online later from user space as it sets acpi_mp_wake_mailbox to
read-only after init. So when the first AP is tried to brought online
after init, the attempt to update the variable results in a kernel panic.
The memremap() call that initializes the variable cannot be moved into
acpi_parse_mp_wake() because memremap() is not functional at that point in
the boot process. Also as the APs might never be brought up, keep the
memremap() call in acpi_wakeup_cpu() so that the operation only takes place
when needed.
Fixes: 24dd05da8c ("x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init")
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20240805103531.1230635-1-zhiquan1.li@intel.com
Add brackets around if/for constructs as required by coding style or remove
pointless line breaks to make it true single line statements which do not
require brackets.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155441.032045616@linutronix.de
Use proper comment styles and shrink comments to their scope where
applicable.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.969619978@linutronix.de
Cleanup the APIC printk()s which are inside of a apic verbosity guarded
region by using apic_dbg() for the KERN_DEBUG level prints.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.714763708@linutronix.de
Replace apic_printk($LEVEL) with the corresponding apic_pr_*() helpers and
use pr_info() for APIC_QUIET as that is always printed so the indirection
is pointless noise.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.652239904@linutronix.de
Use the new apic_pr_*() helpers and cleanup the apic_printk() maze.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.589821068@linutronix.de
Make them conforming to the TIP coding style guide.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.402005874@linutronix.de
Only invoked from check_timer() which is __init too. Cleanup the variable
declaration while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.339321108@linutronix.de
Breno observed panics when using failslab under certain conditions during
runtime:
can not alloc irq_pin_list (-1,0,20)
Kernel panic - not syncing: IO-APIC: failed to add irq-pin. Can not proceed
panic+0x4e9/0x590
mp_irqdomain_alloc+0x9ab/0xa80
irq_domain_alloc_irqs_locked+0x25d/0x8d0
__irq_domain_alloc_irqs+0x80/0x110
mp_map_pin_to_irq+0x645/0x890
acpi_register_gsi_ioapic+0xe6/0x150
hpet_open+0x313/0x480
That's a pointless panic which is a leftover of the historic IO/APIC code
which panic'ed during early boot when the interrupt allocation failed.
The only place which might justify panic is the PIT/HPET timer_check() code
which tries to figure out whether the timer interrupt is delivered through
the IO/APIC. But that code does not require to handle interrupt allocation
failures. If the interrupt cannot be allocated then timer delivery fails
and it either panics due to that or falls back to legacy mode.
Cure this by removing the panic wrapper around __add_pin_to_irq_node() and
making mp_irqdomain_alloc() aware of the failure condition and handle it as
any other failure in this function gracefully.
Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Breno Leitao <leitao@debian.org>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/all/ZqfJmUF8sXIyuSHN@gmail.com
Link: https://lore.kernel.org/all/20240802155440.275200843@linutronix.de
Currently ARM64 extracts which specific sanitizer has caused a trap via
encoded data in the trap instruction. Clang on x86 currently encodes the
same data in the UD1 instruction but x86 handle_bug() and
is_valid_bugaddr() currently only look at UD2.
Bring x86 to parity with ARM64, similar to commit 25b84002af ("arm64:
Support Clang UBSAN trap codes for better reporting"). See the llvm
links for information about the code generation.
Enable the reporting of UBSAN sanitizer details on x86 compiled with clang
when CONFIG_UBSAN_TRAP=y by analysing UD1 and retrieving the type immediate
which is encoded by the compiler after the UD1.
[ tglx: Simplified it by moving the printk() into handle_bug() ]
Signed-off-by: Gatlin Newhouse <gatlin.newhouse@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20240724000206.451425-1-gatlin.newhouse@gmail.com
Link: https://github.com/llvm/llvm-project/commit/c5978f42ec8e9#diff-bb68d7cd885f41cfc35843998b0f9f534adb60b415f647109e597ce448e92d9f
Link: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86InstrSystem.td#L27
A kexec kernel boot failure is sometimes observed on AMD CPUs due to an
unmapped EFI config table array. This can be seen when "nogbpages" is on
the kernel command line, and has been observed as a full BIOS reboot rather
than a successful kexec.
This was also the cause of reported regressions attributed to Commit
7143c5f4cf20 ("x86/mm/ident_map: Use gbpages only where full GB page should
be mapped.") which was subsequently reverted.
To avoid this page fault, explicitly include the EFI config table array in
the kexec identity map.
Further explanation:
The following 2 commits caused the EFI config table array to be
accessed when enabling sev at kernel startup.
commit ec1c66af3a ("x86/compressed/64: Detect/setup SEV/SME features
earlier during boot")
commit c01fce9cef ("x86/compressed: Add SEV-SNP feature
detection/setup")
This is in the code that examines whether SEV should be enabled or not, so
it can even affect systems that are not SEV capable.
This may result in a page fault if the EFI config table array's address is
unmapped. Since the page fault occurs before the new kernel establishes its
own identity map and page fault routines, it is unrecoverable and kexec
fails.
Most often, this problem is not seen because the EFI config table array
gets included in the map by the luck of being placed at a memory address
close enough to other memory areas that *are* included in the map created
by kexec.
Both the "nogbpages" command line option and the "use gpbages only where
full GB page should be mapped" change greatly reduce the chance of being
included in the map by luck, which is why the problem appears.
Signed-off-by: Tao Liu <ltao@redhat.com>
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavin Joseph <me@pavinjoseph.com>
Tested-by: Sarah Brofeldt <srhb@dbc.dk>
Tested-by: Eric Hagberg <ehagberg@gmail.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/all/20240717213121.3064030-2-steve.wahl@hpe.com
A Linux guest on Hyper-V gets the TSC frequency from a synthetic MSR, if
available. In this case, set X86_FEATURE_TSC_KNOWN_FREQ so that Linux
doesn't unnecessarily do refined TSC calibration when setting up the TSC
clocksource.
With this change, a message such as this is no longer output during boot
when the TSC is used as the clocksource:
[ 1.115141] tsc: Refined TSC clocksource calibration: 2918.408 MHz
Furthermore, the guest and host will have exactly the same view of the
TSC frequency, which is important for features such as the TSC deadline
timer that are emulated by the Hyper-V host.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Link: https://lore.kernel.org/r/20240606025559.1631-1-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240606025559.1631-1-mhklinux@outlook.com>
The unsynchronized_tsc() eventually checks num_possible_cpus(), and if the
system is non-Intel and the number of possible CPUs is greater than one,
assumes that TSCs are unsynchronized. This despite the comment saying
"assume multi socket systems are not synchronized", that is, socket rather
than CPU. This behavior was preserved by commit 8fbbc4b45c ("x86: merge
tsc_init and clocksource code") and by the previous relevant commit
7e69f2b1ea ("clocksource: Remove the update callback").
The clocksource drivers were added by commit 5d0cf410e9 ("Time: i386
Clocksource Drivers") back in 2006, and the comment still said "socket"
rather than "CPU".
Therefore, bravely (and perhaps foolishly) make the code match the
comment.
Note that it is possible to bypass both code and comment by booting
with tsc=reliable, but this also disables the clocksource watchdog,
which is undesirable when trust in the TSC is strictly limited.
Reported-by: Zhengxu Chen <zhxchen17@meta.com>
Reported-by: Danielle Costantino <dcostantino@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-5-paulmck@kernel.org
According to the data sheet, writing the MODE register should stop the
counter (and thus the interrupts). This appears to work on real hardware,
at least modern Intel and AMD systems. It should also work on Hyper-V.
However, on some buggy virtual machines the mode change doesn't have any
effect until the counter is subsequently loaded (or perhaps when the IRQ
next fires).
So, set MODE 0 and then load the counter, to ensure that those buggy VMs
do the right thing and the interrupts stop. And then write MODE 0 *again*
to stop the counter on compliant implementations too.
Apparently, Hyper-V keeps firing the IRQ *repeatedly* even in mode zero
when it should only happen once, but the second MODE write stops that too.
Userspace test program (mostly written by tglx):
=====
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/io.h>
static __always_inline void __out##bwl(type value, uint16_t port) \
{ \
asm volatile("out" #bwl " %" #bw "0, %w1" \
: : "a"(value), "Nd"(port)); \
} \
\
static __always_inline type __in##bwl(uint16_t port) \
{ \
type value; \
asm volatile("in" #bwl " %w1, %" #bw "0" \
: "=a"(value) : "Nd"(port)); \
return value; \
}
BUILDIO(b, b, uint8_t)
#define inb __inb
#define outb __outb
#define PIT_MODE 0x43
#define PIT_CH0 0x40
#define PIT_CH2 0x42
static int is8254;
static void dump_pit(void)
{
if (is8254) {
// Latch and output counter and status
outb(0xC2, PIT_MODE);
printf("%02x %02x %02x\n", inb(PIT_CH0), inb(PIT_CH0), inb(PIT_CH0));
} else {
// Latch and output counter
outb(0x0, PIT_MODE);
printf("%02x %02x\n", inb(PIT_CH0), inb(PIT_CH0));
}
}
int main(int argc, char* argv[])
{
int nr_counts = 2;
if (argc > 1)
nr_counts = atoi(argv[1]);
if (argc > 2)
is8254 = 1;
if (ioperm(0x40, 4, 1) != 0)
return 1;
dump_pit();
printf("Set oneshot\n");
outb(0x38, PIT_MODE);
outb(0x00, PIT_CH0);
outb(0x0F, PIT_CH0);
dump_pit();
usleep(1000);
dump_pit();
printf("Set periodic\n");
outb(0x34, PIT_MODE);
outb(0x00, PIT_CH0);
outb(0x0F, PIT_CH0);
dump_pit();
usleep(1000);
dump_pit();
dump_pit();
usleep(100000);
dump_pit();
usleep(100000);
dump_pit();
printf("Set stop (%d counter writes)\n", nr_counts);
outb(0x30, PIT_MODE);
while (nr_counts--)
outb(0xFF, PIT_CH0);
dump_pit();
usleep(100000);
dump_pit();
usleep(100000);
dump_pit();
printf("Set MODE 0\n");
outb(0x30, PIT_MODE);
dump_pit();
usleep(100000);
dump_pit();
usleep(100000);
dump_pit();
return 0;
}
=====
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhkelley@outlook.com>
Link: https://lore.kernel.org/all/20240802135555.564941-2-dwmw2@infradead.org
Leaving the PIT interrupt running can cause noticeable steal time for
virtual guests. The VMM generally has a timer which toggles the IRQ input
to the PIC and I/O APIC, which takes CPU time away from the guest. Even
on real hardware, running the counter may use power needlessly (albeit
not much).
Make sure it's turned off if it isn't going to be used.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhkelley@outlook.com>
Link: https://lore.kernel.org/all/20240802135555.564941-1-dwmw2@infradead.org
A process can disable access to the alternate signal stack by not
enabling the altstack's PKEY in the PKRU register.
Nevertheless, the kernel updates the PKRU temporarily for signal
handling. However, in sigreturn(), restore_sigcontext() will restore the
PKRU to the user-defined PKRU value.
This will cause restore_altstack() to fail with a SIGSEGV as it needs read
access to the altstack which is prohibited by the user-defined PKRU value.
Fix this by restoring altstack before restoring PKRU.
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802061318.2140081-5-aruna.ramakrishna@oracle.com
If the alternate signal stack is protected by a different PKEY than the
current execution stack, copying XSAVE data to the sigaltstack will fail
if its PKEY is not enabled in the PKRU register.
It's unknown which pkey was used by the application for the altstack, so
enable all PKEYS before XSAVE.
But this updated PKRU value is also pushed onto the sigframe, which
means the register value restored from sigcontext will be different from
the user-defined one, which is incorrect.
Fix that by overwriting the PKRU value on the sigframe with the original,
user-defined PKRU.
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802061318.2140081-4-aruna.ramakrishna@oracle.com
In the case where a user thread sets up an alternate signal stack protected
by the default PKEY (i.e. PKEY 0), while the thread's stack is protected by
a non-zero PKEY, both these PKEYS have to be enabled in the PKRU register
for the signal to be delivered to the application correctly. However, the
PKRU value restored after handling the signal must not enable this extra
PKEY (i.e. PKEY 0) - i.e., the PKRU value in the sigframe has to be
overwritten with the user-defined value.
Add helper functions that will update PKRU value in the sigframe after
XSAVE.
Note that sig_prepare_pkru() makes no assumption about which PKEY could
be used to protect the altstack (i.e. it may not be part of init_pkru),
and so enables all PKEYS.
No functional change.
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802061318.2140081-3-aruna.ramakrishna@oracle.com
Assume there's a multithreaded application that runs untrusted user
code. Each thread has its stack/code protected by a non-zero PKEY, and the
PKRU register is set up such that only that particular non-zero PKEY is
enabled. Each thread also sets up an alternate signal stack to handle
signals, which is protected by PKEY zero. The PKEYs man page documents that
the PKRU will be reset to init_pkru when the signal handler is invoked,
which means that PKEY zero access will be enabled. But this reset happens
after the kernel attempts to push fpu state to the alternate stack, which
is not (yet) accessible by the kernel, which leads to a new SIGSEGV being
sent to the application, terminating it.
Enabling both the non-zero PKEY (for the thread) and PKEY zero in
userspace will not work for this use case. It cannot have the alt stack
writeable by all - the rationale here is that the code running in that
thread (using a non-zero PKEY) is untrusted and should not have access
to the alternate signal stack (that uses PKEY zero), to prevent the
return address of a function from being changed. The expectation is that
kernel should be able to set up the alternate signal stack and deliver
the signal to the application even if PKEY zero is explicitly disabled
by the application. The signal handler accessibility should not be
dictated by whatever PKRU value the thread sets up.
The PKRU register is managed by XSAVE, which means the sigframe contents
must match the register contents - which is not the case here. It's
required that the signal frame contains the user-defined PKRU value (so
that it is restored correctly from sigcontext) but the actual register must
be reset to init_pkru so that the alt stack is accessible and the signal
can be delivered to the application. It seems that the proper fix here
would be to remove PKRU from the XSAVE framework and manage it separately,
which is quite complicated. As a workaround, do this:
orig_pkru = rdpkru();
wrpkru(orig_pkru & init_pkru_value);
xsave_to_user_sigframe();
put_user(pkru_sigframe_addr, orig_pkru)
In preparation for writing PKRU to sigframe, pass PKRU as an additional
parameter down the call chain from get_sigframe().
No functional change.
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802061318.2140081-2-aruna.ramakrishna@oracle.com
Current AMD systems can report MCA errors using the ACPI Boot Error
Record Table (BERT). The BERT entries for MCA errors will be an x86
Common Platform Error Record (CPER) with an MSR register context that
matches the MCAX/SMCA register space.
However, the BERT will not necessarily be processed on the CPU that
reported the MCA errors. Therefore, the correct CPU number needs to be
determined and the information saved in struct mce.
Use the newly defined mce_prep_record_*() helpers to get the correct
data.
Also, add an explicit check to verify that a valid CPU number was found
from the APIC ID search.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20240730182958.4117158-4-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Generally, MCA information for an error is gathered on the CPU that
reported the error. In this case, CPU-specific information from the
running CPU will be correct.
However, this will be incorrect if the MCA information is gathered while
running on a CPU that didn't report the error. One example is creating
an MCA record using mce_prep_record() for errors reported from ACPI.
Split mce_prep_record() so that there is a helper function to gather
common, i.e. not CPU-specific, information and another helper for
CPU-specific information.
Leave mce_prep_record() defined as-is for the common case when running
on the reporting CPU.
Get MCG_CAP in the global helper even though the register is per-CPU.
This value is not already cached per-CPU like other values. And it does
not assist with any per-CPU decoding or handling.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20240730182958.4117158-3-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
There is no MCE "setup" done in mce_setup(). Rather, this function initializes
and prepares an MCE record.
Rename the function to highlight what it does.
No functional change is intended.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20240730182958.4117158-2-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Commit in Fixes was added as a catch-all for cases where the cmdline is
parsed before being merged with the builtin one.
And promptly one issue appeared, see Link below. The microcode loader
really needs to parse it that early, but the merging happens later.
Reshuffling the early boot nightmare^W code to handle that properly would
be a painful exercise for another day so do the chicken thing and parse the
builtin cmdline too before it has been merged.
Fixes: 0c40b1c7a8 ("x86/setup: Warn when option parsing is done too early")
Reported-by: Mike Lothian <mike@fireburn.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240730152108.GAZqkE5Dfi9AuKllRw@fat_crate.local
Link: https://lore.kernel.org/r/20240722152330.GCZp55ck8E_FT4kPnC@fat_crate.local
Commit b50db7095f ("x86/tsc: Disable clocksource watchdog for TSC on
qualified platorms") was introduced to solve problem that sometimes TSC
clocksource is wrongly judged as unstable by watchdog like 'jiffies', HPET,
etc.
In it, the hardware package number is a key factor for judging whether to
disable the watchdog for TSC, and 'nr_online_nodes' was chosen due to, at
that time (kernel v5.1x), it is available in early boot phase before
registering 'tsc-early' clocksource, where all non-boot CPUs are not
brought up yet.
Dave and Rui pointed out there are many cases in which 'nr_online_nodes'
is cheated and not accurate, like:
* SNC (sub-numa cluster) mode enabled
* numa emulation (numa=fake=8 etc.)
* numa=off
* platforms with CPU-less HBM nodes, CPU-less Optane memory nodes.
* 'maxcpus=' cmdline setup, where chopped CPUs could be onlined later
* 'nr_cpus=', 'possible_cpus=' cmdline setup, where chopped CPUs can
not be onlined after boot
The SNC case is the most user-visible case, as many CSP (Cloud Service
Provider) enable this feature in their server fleets. When SNC3 enabled, a
2 socket machine will appear to have 6 NUMA nodes, and get impacted by the
issue in reality.
Thomas' recent patchset of refactoring x86 topology code improves
topology_max_packages() greatly, by making it more accurate and available
in early boot phase, which works well in most of the above cases.
The only exceptions are 'nr_cpus=' and 'possible_cpus=' setup, which may
under-estimate the package number. As during topology setup, the boot CPU
iterates through all enumerated APIC IDs and either accepts or rejects the
APIC ID. For accepted IDs, it figures out which bits of the ID map to the
package number. It tracks which package numbers have been seen in a
bitmap. topology_max_packages() just returns the number of bits set in
that bitmap.
'nr_cpus=' and 'possible_cpus=' can cause more APIC IDs to be rejected and
can artificially lower the number of bits in the package bitmap and thus
topology_max_packages(). This means that, for example, a system with 8
physical packages might reject all the CPUs on 6 of those packages and be
left with only 2 packages and 2 bits set in the package bitmap. It needs
the TSC watchdog, but would disable it anyway. This isn't ideal, but it
only happens for debug-oriented options. This is fixable by tracking the
package numbers for rejected CPUs. But it's not worth the trouble for
debugging.
So use topology_max_packages() to replace nr_online_nodes().
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/all/20240729021202.180955-1-feng.tang@intel.com
Closes: https://lore.kernel.org/lkml/a4860054-0f16-6513-f121-501048431086@intel.com/
Add some new Zen5 models for the 0x1A family.
[ bp: Merge the 0x60 and 0x70 ranges. ]
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240729064626.24297-1-bp@kernel.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some
mitigations have entries in Kconfig, and they could be modified, while others
mitigations do not have Kconfig entries, and could not be controlled at build
time.
Create a new kernel config that allows GDS to be completely disabled,
similarly to the "gather_data_sampling=off" or "mitigations=off" kernel
command-line.
Now, there are two options for GDS mitigation:
* CONFIG_MITIGATION_GDS=n -> Mitigation disabled (New)
* CONFIG_MITIGATION_GDS=y -> Mitigation enabled (GDS_MITIGATION_FULL)
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-12-leitao@debian.org
Remove the MITIGATION_GDS_FORCE Kconfig option, which aggressively disables
AVX as a mitigation for Gather Data Sampling (GDS) vulnerabilities. This
option is not widely used by distros.
While removing the Kconfig option, retain the runtime configuration ability
through the `gather_data_sampling=force` kernel parameter. This allows users
to still enable this aggressive mitigation if needed, without baking it into
the kernel configuration.
Simplify the kernel configuration while maintaining flexibility for runtime
mitigation choices.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Link: https://lore.kernel.org/r/20240729164105.554296-11-leitao@debian.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.
Create an entry for the SSB CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-10-leitao@debian.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.
Create an entry for the Spectre V2 CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-9-leitao@debian.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.
Create an entry for the SRBDS CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-8-leitao@debian.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.
Create an entry for the Spectre v1 CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-7-leitao@debian.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.
Create an entry for the RETBLEED CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-6-leitao@debian.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.
Create an entry for the L1TF CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-5-leitao@debian.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.
Create an entry for the MMIO Stale data CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-4-leitao@debian.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.
Create an entry for the TAA CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-3-leitao@debian.org
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.
Create an entry for the MDS CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-2-leitao@debian.org
Initialize equiv_id in order to shut up:
arch/x86/kernel/cpu/microcode/amd.c:714:6: warning: variable 'equiv_id' is \
used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
if (x86_family(bsp_cpuid_1_eax) < 0x17) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
because clang doesn't do interprocedural analysis for warnings to see
that this variable won't be used uninitialized.
Fixes: 94838d230a ("x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407291815.gJBST0P3-lkp@intel.com/
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
The broken patch results in a call to init_freq_invariance_cppc() in a CPU
hotplug handler in both the path for initially present CPUs and those
hotplugged later. That function includes a one time call to
amd_set_max_freq_ratio() which in turn calls freq_invariance_enable() that has
a static_branch_enable() which takes the cpu_hotlug_lock which is already
held.
Avoid the deadlock by using static_branch_enable_cpuslocked() as the lock will
always be already held. The equivalent path on Intel does not already hold
this lock, so take it around the call to freq_invariance_enable(), which
results in it being held over the call to register_syscall_ops, which looks to
be safe to do.
Fixes: c1385c1f0b ("ACPI: processor: Simplify initial onlining to use same path for cold and hotplug")
Closes: https://lore.kernel.org/all/CABXGCsPvqBfL5hQDOARwfqasLRJ_eNPBbCngZ257HOe=xbWDkA@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240729105504.2170-1-Jonathan.Cameron@huawei.com
Add a new .note section containing type, size, offset and flags of every
xfeature that is present.
This information will be used by debuggers to understand the XSAVE layout of
the machine where the core file has been dumped, and to read XSAVE registers,
especially during cross-platform debugging.
The XSAVE layouts of modern AMD and Intel CPUs differ, especially since
Memory Protection Keys and the AVX-512 features have been inculcated into
the AMD CPUs.
Since AMD never adopted (and hence never left room in the XSAVE layout for)
the Intel MPX feature, tools like GDB had assumed a fixed XSAVE layout
matching that of Intel (based on the XCR0 mask).
Hence, core dumps from AMD CPUs didn't match the known size for the XCR0 mask.
This resulted in GDB and other tools not being able to access the values of
the AVX-512 and PKRU registers on AMD CPUs.
To solve this, an interim solution has been accepted into GDB, and is already
a part of GDB 14, see
https://sourceware.org/pipermail/gdb-patches/2023-March/198081.html.
But it depends on heuristics based on the total XSAVE register set size
and the XCR0 mask to infer the layouts of the various register blocks
for core dumps, and hence, is not a foolproof mechanism to determine the
layout of the XSAVE area.
Therefore, add a new core dump note in order to allow GDB/LLDB and other
relevant tools to determine the layout of the XSAVE area of the machine where
the corefile was dumped.
The new core dump note (which is being proposed as a per-process .note
section), NT_X86_XSAVE_LAYOUT (0x205) contains an array of structures.
Each structure describes an individual extended feature containing
offset, size and flags in this format:
struct x86_xfeat_component {
u32 type;
u32 size;
u32 offset;
u32 flags;
};
and in an independent manner, allowing for future extensions without depending
on hw arch specifics like CPUID etc.
[ bp: Massage commit message, zap trailing whitespace. ]
Co-developed-by: Jini Susan George <jinisusan.george@amd.com>
Signed-off-by: Jini Susan George <jinisusan.george@amd.com>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Vignesh Balasubramanian <vigbalas@amd.com>
Link: https://lore.kernel.org/r/20240725161017.112111-2-vigbalas@amd.com
On Zen and newer, the family, model and stepping is part of the
microcode patch ID so that the equivalence table the driver has been
using, is not needed anymore.
So switch the driver to use that from now on.
The equivalence table in the microcode blob should still remain in case
there's need to pass some additional information to the kernel loader.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240725112037.GBZqI1BbUk1KMlOJ_D@fat_crate.local
Add new PCI device IDs into the root IDs and miscellaneous IDs lists to
provide support for the latest generation of AMD 1Ah family 60h processor
models.
Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20240722092801.3480266-1-Shyam-sundar.S-k@amd.com
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
Uprobes:
- x86/shstk: Make return uprobe work with shadow stack.
- Add uretprobe syscall which speeds up the uretprobe 10-30% faster. This
syscall is automatically used from user-space trampolines which are
generated by the uretprobe. If this syscall is used by normal
user program, it will cause SIGILL. Note that this is currently only
implemented on x86_64.
(This also has 2 fixes for adjusting the syscall number to avoid conflict
with new *attrat syscalls.)
- uprobes/perf: fix user stack traces in the presence of pending uretprobe.
This corrects the uretprobe's trampoline address in the stacktrace with
correct return address.
- selftests/x86: Add a return uprobe with shadow stack test.
- selftests/bpf: Add uretprobe syscall related tests.
. test case for register integrity check.
. test case with register changing case.
. test case for uretprobe syscall without uprobes (expected to be failed).
. test case for uretprobe with shadow stack.
- selftests/bpf: add test validating uprobe/uretprobe stack traces
- MAINTAINERS: Add uprobes entry. This does not specify the tree but to
clarify who maintains and reviews the uprobes.
Kprobes:
- tracing/kprobes: Test case cleanups. Replace redundant WARN_ON_ONCE() +
pr_warn() with WARN_ONCE() and remove unnecessary code from selftest.
- tracing/kprobes: Add symbol counting check when module loads. This
checks the uniqueness of the probed symbol on modules. The same check
has already done for kernel symbols.
(This also has a fix for build error with CONFIG_MODULES=n)
Cleanup:
- Add MODULE_DESCRIPTION() macros for fprobe and kprobe examples.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmaWYxwbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bsUgH/3JcSzDZujQWCZ1f4fJn
QecvTFSYcCl6ck8+/3wm4EsgeCXIFOyPnoPc7k2Gm+l6Dlk1DKGV6wV4tuKFUq9X
9mplcwoVA0Ln+EX9zv9v4s99yUGxcU9xjgC9XT7J52SvqYncPIi6dR0Z9wlJBmyd
Bx3cZk+wSzCYaoqYngI2fKlzsEcYgDIP999fQPRi0HGzNZujc4xeJyjCTC/48yWO
9kreRQq6wFdgRQTwMcR/fKPDKIGZQCU8jkXv5crVV5K3rNaBcwBmCJJMP8PzPU0V
UQ0+8RZK+Qk8SBwXcMNVRqm/efTderob4IYxP8OBe5wjAIE7+vu8r6sqwxRIS54M
Cyg=
=DRSr
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
"Uprobes:
- x86/shstk: Make return uprobe work with shadow stack
- Add uretprobe syscall which speeds up the uretprobe 10-30% faster.
This syscall is automatically used from user-space trampolines
which are generated by the uretprobe. If this syscall is used by
normal user program, it will cause SIGILL. Note that this is
currently only implemented on x86_64.
(This also has two fixes for adjusting the syscall number to avoid
conflict with new *attrat syscalls.)
- uprobes/perf: fix user stack traces in the presence of pending
uretprobe. This corrects the uretprobe's trampoline address in the
stacktrace with correct return address
- selftests/x86: Add a return uprobe with shadow stack test
- selftests/bpf: Add uretprobe syscall related tests.
- test case for register integrity check
- test case with register changing case
- test case for uretprobe syscall without uprobes (expected to fail)
- test case for uretprobe with shadow stack
- selftests/bpf: add test validating uprobe/uretprobe stack traces
- MAINTAINERS: Add uprobes entry. This does not specify the tree but
to clarify who maintains and reviews the uprobes
Kprobes:
- tracing/kprobes: Test case cleanups.
Replace redundant WARN_ON_ONCE() + pr_warn() with WARN_ONCE() and
remove unnecessary code from selftest
- tracing/kprobes: Add symbol counting check when module loads.
This checks the uniqueness of the probed symbol on modules. The
same check has already done for kernel symbols
(This also has a fix for build error with CONFIG_MODULES=n)
Cleanup:
- Add MODULE_DESCRIPTION() macros for fprobe and kprobe examples"
* tag 'probes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
MAINTAINERS: Add uprobes entry
selftests/bpf: Change uretprobe syscall number in uprobe_syscall test
uprobe: Change uretprobe syscall scope and number
tracing/kprobes: Fix build error when find_module() is not available
tracing/kprobes: Add symbol counting check when module loads
selftests/bpf: add test validating uprobe/uretprobe stack traces
perf,uprobes: fix user stack traces in the presence of pending uretprobes
tracing/kprobe: Remove cleanup code unrelated to selftest
tracing/kprobe: Integrate test warnings into WARN_ONCE
selftests/bpf: Add uretprobe shadow stack test
selftests/bpf: Add uretprobe syscall call from user space test
selftests/bpf: Add uretprobe syscall test for regs changes
selftests/bpf: Add uretprobe syscall test for regs integrity
selftests/x86: Add return uprobe shadow stack test
uprobe: Add uretprobe syscall to speed up return probe
uprobe: Wire up uretprobe system call
x86/shstk: Make return uprobe work with shadow stack
samples: kprobes: add missing MODULE_DESCRIPTION() macros
fprobe: add missing MODULE_DESCRIPTION() macro
core:
- deprecate DRM data and return 0 date
- connector: Create a set of helpers to help with HDMI support
- Remove driver owner assignments
- Allow more drivers to compile with COMPILE_TEST
- Conversions to drm_edid
- Sprinkle MODULE_DESCRIPTIONS everywhere they are missing
- Remove drm_mm_replace_node
- print: Add a drm prefix to warn level messages too, remove
___drm_dbg, consolidate prefix handling
- New monochrome TV mode variant
ttm:
- improve number of page faults on some platforms
- fix test builds under PREEMPT_RT
- more test coverage
ci:
- Require a more recent version of mesa,
- improve farm setup and test generation
dma-buf:
- warn if reserving 0 fence slots
- internal API heap enhancements
fbdev:
- Create memory manager optimized fbdev emulation
panic:
- Allow to select fonts,
- improve drm_fb_dma_get_scanout_buffer
- Allow to dump kmsg to the screen
bridge:
- Remove redundant checks on bridge->encoder
- Remove drm_bridge_chain_mode_fixup
- bridge-connector: Plumb in the new HDMI helper
- analogix_dp: Various improvements, handle AUX transfers timeout
- samsung-dsim: Fix timings calculation
- tc358767: Plenty of small fixes, fix no connector attach, fix clocks
- sii902x: state validation improvements
panels:
- Switch panels from register table initialization to proper code
- Now that the panel code tracks the panel state, remove every
ad-hoc implementation in the panel drivers
- More cleanup of prepare / enable state tracking in drivers
- edp: Drop legacy panel compatibles
- simple-bridge: Switch to devm_drm_bridge_add
- New panels: Lincoln Tech Sol LCD185-101CT, Microtips Technology
13-101HIEBCAF0-C, Microtips Technology MF-103HIEB0GA0, BOE
nv110wum-l60, IVO t109nw41, WL-355608-A8, PrimeView PM070WL4,
Lincoln Technologies LCD197, Ortustech COM35H3P70ULC,
AUO G104STN01, K&d kd101ne3-40ti
amdgpu:
- DCN 4.0.x support
- GC 12.0 support
- GMC 12.0 support
- SDMA 7.0 support
- MES12 support
- MMHUB 4.1 support
- GFX12 modifier and DCC support
- lots of IP fixes/updates
amdkfd:
- Contiguous VRAM allocations
- GC 12.0 support
- SDMA 7.0 support
- SR-IOV fixes
- KFD GFX ALU exceptions
i915:
- Battlemage Xe2 HPD display enablement
- Panel Replay enabling
- DP AUX-less ALPM/LOBF
- Enable link training failure fallback for DP MST links
- CMRR (Content Match Refresh Rate) enabling
- Increase ADL-S/ADL-P/DG2+ max TMDS bitrate to 6 Gbps
- Enable eDP AUX based HDR backlight
- Support replaying GPU hangs with captured context image
- Automate CCS Mode setting during engine resets
- lots of refactoring
- Support replaying GPU hangs with captured context image
- Increase FLR timeout from 3s to 9s
- Enable w/a 16021333562 for DG2, MTL and ARL [guc]
xe:
- update MAINATINERS
- New uapi adding OA functionality to Xe
- expose l3 bank mask
- fix display detect on ADL-N
- runtime PM Fixes
- Fix silent backmerge issues
- More prep for SR-IOV
- HWmon additions
- per client usage info
- Rework GPU page fault handling
- Drop EXEC_QUEUE_FLAG_BANNED
- Add BMG PCI IDs
- Scheduler fixes and improvements
- Rename xe_exec_queue::compute to xe_exec_queue::lr
- Use ttm_uncached for BO with NEEDS_UC flag
- Rename xe perf layer as xe observation layer
- lots of refactoring
radeon:
- Backlight workaround for iMac
- Silence UBSAN flex array warnings
msm:
- Validate registers XML description against schema in CI
- core/dpu: SM7150 support
- mdp5: Add support for MSM8937
- gpu: Add param for userspace to know if raytracing is supported
- gpu: X185 support (aka gpu in X1 laptop chips)
- gpu: a505 support
ivpu:
- hardware scheduler support
- profiling support
- improvements to the platform support layer
- firmware handling improvements
- clocks/power mgmt improvements
- scheduler/logging improvements
habanalabs:
- Gradual sleep in polling memory macro.
- Reduce Gaudi2 MSI-X interrupt count to 128.
- Add Gaudi2-D revision support.
- Add timestamp to CPLD info.
- Gaudi2: Assume hard-reset by firmware upon MC SEI severe error.
- Align Gaudi2 interrupt names.
- Check for errors after preboot is ready.
- Change habanalabs maintainer and git repo path.
mgag200:
- refactoring and improvements
- Add BMC output
- enable polling
nouveau:
- add registry command line
v3d:
- perf counters improvements
zynqmp:
- irq and debugfs improvements
atmel-hlcdc:
- Support XLCDC in sam9x7
mipi-dbi:
- Remove mipi_dbi_machine_little_endian
- make SPI bits per word configurable
- support RGB888
- allow pixel formats to be specified in the DT
sun4i:
- Rework the blender setup for DE2
panfrost:
- Enable MT8188 support
vc4:
- Monochrome TV support
exynos:
- fix fallback mode regression
- fix memory leak
- Use drm_edid_duplicate() instead of kmemdup()
etnaviv:
- fix i.MX8MP NPU clock gating
- workaround FE register cdc issues on some cores
- fix DMA sync handling for cached buffers
- fix job timeout handling
- keep TS enabled on MMUv2 cores for improved performance
mediatek:
- Convert to platform remove callback returning void-
- Drop chain_mode_fixup call in mode_valid()
- Fixes the errors of MediaTek display driver found by IGT.
- Add display support for the MT8365-EVK board
- Fix bit depth overwritten for mtk_ovl_set bit_depth()
- Fix possible_crtcs calculation
- Fix spurious kfree()
ast:
- refactor mode setting code
stm:
- Add LVDS support
- DSI PHY updates
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmaYqVEACgkQDHTzWXnE
hr5p3Q/+OOxTHKJ/8WMwfV1Tuep5otkCZdBgNdcuu9zqzpEMEDUDwmV1iboIvT9x
qJsDwSAJomwbZAnVjDKsbZuycSHUBV6HQdf+5+rtq6be1EfFRwJVzOq0u5+D3KGt
7f2vy6sM9tw4tR6EikiuP7vCvnSz4iGrWERvEJDEtXECbALhju8sulht8ZMnr6GW
/MfUetULLSDjq0L1x3TWAq2MPGnJ5UxIkIeOBUP6n4etAUX1BPTNA6N76eN/xMvn
a40JhtM+pCjjkHxvloIZ+KTYN3S+hskIRksczPHh9HtNX7y/A437wyhOHJZ1NvZb
yc5ke9GjXxGcxyZH+PY5aCS7O/XElzSSkR1jFZ2s3/MX7PVKgCahGK7+yWjPsiK2
R5oXebdObshUa8LHDE/3WgBUmTchkvKRTXV9cvGqzxEPhC2zrxArvwP5v6B4mhCn
Vqo3Pv0Cyr+n65Z5Dzqz/9+m999LJjFTsTrug0p5b/qBJQKu2rQONe4lpZ0NFwwY
ExyjdxILj7mqrQpKcA6V5Bel5ZCnlVsGfTshFL6Iux54VFlJyRMzKWZ+Gdv4av5k
dbjz+re+CojKabn3ML/7pAQujK6Rqe58vPuHV78zkvAGJnQgJOOTrmYNYtn3oBqe
ogdCN+/PREb/9U7i6mQv5hhdHs4tT9ROXaT9jyb8XSHXW+t9lBM=
=g+Ad
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2024-07-18' of https://gitlab.freedesktop.org/drm/kernel
Pull drm updates from Dave Airlie:
"There's a lot of stuff in here, amd, i915 and xe have new platform
work, lots of core rework around EDID handling, some new COMPILE_TEST
options, maintainer changes and a lots of other stuff. Summary:
core:
- deprecate DRM data and return 0 date
- connector: Create a set of helpers to help with HDMI support
- Remove driver owner assignments
- Allow more drivers to compile with COMPILE_TEST
- Conversions to drm_edid
- Sprinkle MODULE_DESCRIPTIONS everywhere they are missing
- Remove drm_mm_replace_node
- print: Add a drm prefix to warn level messages too, remove
___drm_dbg, consolidate prefix handling
- New monochrome TV mode variant
ttm:
- improve number of page faults on some platforms
- fix test builds under PREEMPT_RT
- more test coverage
ci:
- Require a more recent version of mesa
- improve farm setup and test generation
dma-buf:
- warn if reserving 0 fence slots
- internal API heap enhancements
fbdev:
- Create memory manager optimized fbdev emulation
panic:
- Allow to select fonts
- improve drm_fb_dma_get_scanout_buffer
- Allow to dump kmsg to the screen
bridge:
- Remove redundant checks on bridge->encoder
- Remove drm_bridge_chain_mode_fixup
- bridge-connector: Plumb in the new HDMI helper
- analogix_dp: Various improvements, handle AUX transfers timeout
- samsung-dsim: Fix timings calculation
- tc358767: Plenty of small fixes, fix no connector attach, fix
clocks
- sii902x: state validation improvements
panels:
- Switch panels from register table initialization to proper code
- Now that the panel code tracks the panel state, remove every ad-hoc
implementation in the panel drivers
- More cleanup of prepare / enable state tracking in drivers
- edp: Drop legacy panel compatibles
- simple-bridge: Switch to devm_drm_bridge_add
- New panels: Lincoln Tech Sol LCD185-101CT, Microtips Technology
13-101HIEBCAF0-C, Microtips Technology MF-103HIEB0GA0,
BOE nv110wum-l60, IVO t109nw41, WL-355608-A8, PrimeView
PM070WL4, Lincoln Technologies LCD197, Ortustech
COM35H3P70ULC, AUO G104STN01, K&d kd101ne3-40ti
amdgpu:
- DCN 4.0.x support
- GC 12.0 support
- GMC 12.0 support
- SDMA 7.0 support
- MES12 support
- MMHUB 4.1 support
- GFX12 modifier and DCC support
- lots of IP fixes/updates
amdkfd:
- Contiguous VRAM allocations
- GC 12.0 support
- SDMA 7.0 support
- SR-IOV fixes
- KFD GFX ALU exceptions
i915:
- Battlemage Xe2 HPD display enablement
- Panel Replay enabling
- DP AUX-less ALPM/LOBF
- Enable link training failure fallback for DP MST links
- CMRR (Content Match Refresh Rate) enabling
- Increase ADL-S/ADL-P/DG2+ max TMDS bitrate to 6 Gbps
- Enable eDP AUX based HDR backlight
- Support replaying GPU hangs with captured context image
- Automate CCS Mode setting during engine resets
- lots of refactoring
- Support replaying GPU hangs with captured context image
- Increase FLR timeout from 3s to 9s
- Enable w/a 16021333562 for DG2, MTL and ARL [guc]
xe:
- update MAINATINERS
- New uapi adding OA functionality to Xe
- expose l3 bank mask
- fix display detect on ADL-N
- runtime PM Fixes
- Fix silent backmerge issues
- More prep for SR-IOV
- HWmon additions
- per client usage info
- Rework GPU page fault handling
- Drop EXEC_QUEUE_FLAG_BANNED
- Add BMG PCI IDs
- Scheduler fixes and improvements
- Rename xe_exec_queue::compute to xe_exec_queue::lr
- Use ttm_uncached for BO with NEEDS_UC flag
- Rename xe perf layer as xe observation layer
- lots of refactoring
radeon:
- Backlight workaround for iMac
- Silence UBSAN flex array warnings
msm:
- Validate registers XML description against schema in CI
- core/dpu: SM7150 support
- mdp5: Add support for MSM8937
- gpu: Add param for userspace to know if raytracing is supported
- gpu: X185 support (aka gpu in X1 laptop chips)
- gpu: a505 support
ivpu:
- hardware scheduler support
- profiling support
- improvements to the platform support layer
- firmware handling improvements
- clocks/power mgmt improvements
- scheduler/logging improvements
habanalabs:
- Gradual sleep in polling memory macro
- Reduce Gaudi2 MSI-X interrupt count to 128
- Add Gaudi2-D revision support
- Add timestamp to CPLD info
- Gaudi2: Assume hard-reset by firmware upon MC SEI severe error
- Align Gaudi2 interrupt names
- Check for errors after preboot is ready
- Change habanalabs maintainer and git repo path
mgag200:
- refactoring and improvements
- Add BMC output
- enable polling
nouveau:
- add registry command line
v3d:
- perf counters improvements
zynqmp:
- irq and debugfs improvements
atmel-hlcdc:
- Support XLCDC in sam9x7
mipi-dbi:
- Remove mipi_dbi_machine_little_endian
- make SPI bits per word configurable
- support RGB888
- allow pixel formats to be specified in the DT
sun4i:
- Rework the blender setup for DE2
panfrost:
- Enable MT8188 support
vc4:
- Monochrome TV support
exynos:
- fix fallback mode regression
- fix memory leak
- Use drm_edid_duplicate() instead of kmemdup()
etnaviv:
- fix i.MX8MP NPU clock gating
- workaround FE register cdc issues on some cores
- fix DMA sync handling for cached buffers
- fix job timeout handling
- keep TS enabled on MMUv2 cores for improved performance
mediatek:
- Convert to platform remove callback returning void-
- Drop chain_mode_fixup call in mode_valid()
- Fixes the errors of MediaTek display driver found by IGT
- Add display support for the MT8365-EVK board
- Fix bit depth overwritten for mtk_ovl_set bit_depth()
- Fix possible_crtcs calculation
- Fix spurious kfree()
ast:
- refactor mode setting code
stm:
- Add LVDS support
- DSI PHY updates"
* tag 'drm-next-2024-07-18' of https://gitlab.freedesktop.org/drm/kernel: (2501 commits)
drm/amdgpu/mes12: add missing opcode string
drm/amdgpu/mes11: update opcode strings
Revert "drm/amd/display: Reset freesync config before update new state"
drm/omap: Restrict compile testing to PAGE_SIZE less than 64KB
drm/xe: Drop trace_xe_hw_fence_free
drm/xe/uapi: Rename xe perf layer as xe observation layer
drm/amdgpu: remove exp hw support check for gfx12
drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed
drm/amdgpu: flush all cached ras bad pages to eeprom
drm/amdgpu: select compute ME engines dynamically
drm/amd/display: Allow display DCC for DCN401
drm/amdgpu: select compute ME engines dynamically
drm/amdgpu/job: Replace DRM_INFO/ERROR logging
drm/amdgpu: select compute ME engines dynamically
drm/amd/pm: Ignore initial value in smu response register
drm/amdgpu: Initialize VF partition mode
drm/amd/amdgpu: fix SDMA IRQ client ID <-> req mapping
MAINTAINERS: fix Xinhui's name
MAINTAINERS: update powerplay and swsmu
drm/qxl: Pin buffer objects for internal mappings
...
- Add Loongson-3 CPUFreq driver support (Huacai Chen).
- Add support for the Arrow Lake and Lunar Lake platforms and
the out-of-band (OOB) mode on Emerald Rapids to the intel_pstate
cpufreq driver, make it support the highest performance change
interrupt and clean it up (Srinivas Pandruvada).
- Switch cpufreq to new Intel CPU model defines (Tony Luck).
- Simplify the cpufreq driver interface by switching the .exit() driver
callback to the void return data type (Lizhe, Viresh Kumar).
- Make cpufreq_boost_enabled() return bool (Dhruva Gole).
- Add fast CPPC support to the amd-pstate cpufreq driver, address
multiple assorted issues in it and clean it up (Perry Yuan, Mario
Limonciello, Dhananjay Ugwekar, Meng Li, Xiaojian Du).
- Add Allwinner H700 speed bin to the sun50i cpufreq driver (Ryan
Walklin).
- Fix memory leaks and of_node_put() usage in the sun50i and qcom-nvmem
cpufreq drivers (Javier Carrasco).
- Clean up the sti and dt-platdev cpufreq drivers (Jeff Johnson,
Raphael Gallais-Pou).
- Fix deferred probe handling in the TI cpufreq driver and wrong return
values of ti_opp_supply_probe(), and add OPP tables for the AM62Ax and
AM62Px SoCs to it (Bryan Brattlof, Primoz Fiser).
- Avoid overflow of target_freq in .fast_switch() in the SCMI cpufreq
driver (Jagadeesh Kona).
- Use dev_err_probe() in every error path in probe in the Mediatek
cpufreq driver (Nícolas Prado).
- Fix kernel-doc param for longhaul_setstate in the longhaul cpufreq
driver (Yang Li).
- Fix system resume handling in the CPPC cpufreq driver (Riwen Lu).
- Improve the teo cpuidle governor and clean up leftover comments from
the menu cpuidle governor (Christian Loehle).
- Clean up a comment typo in the teo cpuidle governor (Atul Kumar
Pant).
- Add missing MODULE_DESCRIPTION() macro to cpuidle haltpoll (Jeff
Johnson).
- Switch the intel_idle driver to new Intel CPU model defines (Tony
Luck).
- Switch the Intel RAPL driver new Intel CPU model defines (Tony Luck).
- Simplify if condition in the idle_inject driver (Thorsten Blum).
- Fix missing cleanup on error in _opp_attach_genpd() (Viresh Kumar).
- Introduce an OF helper function to inform if required-opps is used
and drop a redundant in-parameter to _set_opp_level() (Ulf Hansson).
- Update pm-graph to v5.12 which includes fixes and major code revamp
for python3.12 (Todd Brandt).
- Address several assorted issues in the cpupower utility (Roman
Storozhenko).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmaVb+8SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxXIUQALFhNTO+wo8uPWUmsp0SV81Sbf17zM0f
9IDpzJTUZLK0stTdLtxY4khcClPE4MrwS/LjSJlvkEVZChHpUw6vFezHmx0O42Ti
Tmv3ezABSAmx6QVRSpyVhE3Hb0BmXW9V+3dtoefofV0JWenN7mqk4Hbb2Jx1Cvbh
zyerUeWWl97yqVMM2l5owKHSvk7SYO6cfML73XcdXQ6pBfQePfekG87i1+r40l+d
qEzdyh6JjqGbdkvZKtI4zO1Hdai9FdlLWSqYmVZGS5XRN8RVvDaHDIDlSijNXAei
DFPFoBVAvl8CymBXXnzDyJJhCCkEb2aX3xD6WzthoCygZt5W+tqfGxyZfViBfb55
kvpyiWZUVaDyX4Hfz1PLnJ7Xg9kPUKUcDDrsV5vKA7W0Sq2T0RbORsVkaP2nIhlY
4Xspp9nEv+78DG0UjT7jT0Py2Oq9I6BTG+pmMTxcgA7G/U5H2uAvvIM/kwQ+30vi
yUxO3W5o9TQmvJF1klHgp3YsCNWZG3IYacHZzUIoPbPusEbevYrCuUNriT+zlANc
Pv/FMfBfHDmU2lHWyLzuoKhlzQosNi9NajMANBJgd55zACWKzgNzFV4P5gIMd1KR
moJYfosbT2RWetEH8Zrh7xA5dewUphe6tibshElbKJHilnP0iFjYhhdb6aQRcuPd
q/RECFYT7z0r
=imBx
-----END PGP SIGNATURE-----
Merge tag 'pm-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add a new cpufreq driver for Loongson-3, add support for new
features in the intel_pstate (Lunar Lake and Arrow Lake platforms, OOB
mode for Emerald Rapids, highest performance change interrupt),
amd-pstate (fast CPPC) and sun50i (Allwinner H700 speed bin) cpufreq
drivers, simplify the cpufreq driver interface, simplify the teo
cpuidle governor, adjust the pm-graph utility for a new version of
Python, address issues and clean up code.
Specifics:
- Add Loongson-3 CPUFreq driver support (Huacai Chen)
- Add support for the Arrow Lake and Lunar Lake platforms and the
out-of-band (OOB) mode on Emerald Rapids to the intel_pstate
cpufreq driver, make it support the highest performance change
interrupt and clean it up (Srinivas Pandruvada)
- Switch cpufreq to new Intel CPU model defines (Tony Luck)
- Simplify the cpufreq driver interface by switching the .exit()
driver callback to the void return data type (Lizhe, Viresh Kumar)
- Make cpufreq_boost_enabled() return bool (Dhruva Gole)
- Add fast CPPC support to the amd-pstate cpufreq driver, address
multiple assorted issues in it and clean it up (Perry Yuan, Mario
Limonciello, Dhananjay Ugwekar, Meng Li, Xiaojian Du)
- Add Allwinner H700 speed bin to the sun50i cpufreq driver (Ryan
Walklin)
- Fix memory leaks and of_node_put() usage in the sun50i and
qcom-nvmem cpufreq drivers (Javier Carrasco)
- Clean up the sti and dt-platdev cpufreq drivers (Jeff Johnson,
Raphael Gallais-Pou)
- Fix deferred probe handling in the TI cpufreq driver and wrong
return values of ti_opp_supply_probe(), and add OPP tables for the
AM62Ax and AM62Px SoCs to it (Bryan Brattlof, Primoz Fiser)
- Avoid overflow of target_freq in .fast_switch() in the SCMI cpufreq
driver (Jagadeesh Kona)
- Use dev_err_probe() in every error path in probe in the Mediatek
cpufreq driver (Nícolas Prado)
- Fix kernel-doc param for longhaul_setstate in the longhaul cpufreq
driver (Yang Li)
- Fix system resume handling in the CPPC cpufreq driver (Riwen Lu)
- Improve the teo cpuidle governor and clean up leftover comments
from the menu cpuidle governor (Christian Loehle)
- Clean up a comment typo in the teo cpuidle governor (Atul Kumar
Pant)
- Add missing MODULE_DESCRIPTION() macro to cpuidle haltpoll (Jeff
Johnson)
- Switch the intel_idle driver to new Intel CPU model defines (Tony
Luck)
- Switch the Intel RAPL driver new Intel CPU model defines (Tony
Luck)
- Simplify if condition in the idle_inject driver (Thorsten Blum)
- Fix missing cleanup on error in _opp_attach_genpd() (Viresh Kumar)
- Introduce an OF helper function to inform if required-opps is used
and drop a redundant in-parameter to _set_opp_level() (Ulf Hansson)
- Update pm-graph to v5.12 which includes fixes and major code revamp
for python3.12 (Todd Brandt)
- Address several assorted issues in the cpupower utility (Roman
Storozhenko)"
* tag 'pm-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (77 commits)
cpufreq: sti: fix build warning
cpufreq: mediatek: Use dev_err_probe in every error path in probe
cpufreq: Add Loongson-3 CPUFreq driver support
cpufreq: Make cpufreq_driver->exit() return void
cpufreq/amd-pstate: Fix the scaling_max_freq setting on shared memory CPPC systems
cpufreq/amd-pstate-ut: Convert nominal_freq to khz during comparisons
cpufreq: pcc: Remove empty exit() callback
cpufreq: loongson2: Remove empty exit() callback
cpufreq: nforce2: Remove empty exit() callback
cpupower: fix lib default installation path
cpufreq: docs: Add missing scaling_available_frequencies description
cpuidle: teo: Don't count non-existent intercepts
cpupower: Disable direct build of the 'bench' subproject
cpuidle: teo: Remove recent intercepts metric
Revert: "cpuidle: teo: Introduce util-awareness"
cpufreq: make cpufreq_boost_enabled() return bool
cpufreq: intel_pstate: Support highest performance change interrupt
x86/cpufeatures: Add HWP highest perf change feature flag
Documentation: cpufreq: amd-pstate: update doc for Per CPU boost control method
cpufreq: amd-pstate: Cap the CPPC.max_perf to nominal_perf if CPB is off
...
- Drop support for the 'fake' EFI memory map on x86
- Add an SMBIOS based tweak to the EFI stub instructing the firmware on
x86 Macbook Pros to keep both GPUs enabled
- Replace 0-sized array with flexible array in EFI memory attributes
table handling
- Drop redundant BSS clearing when booting via the native PE entrypoint
on x86
- Avoid returning EFI_SUCCESS when aborting on an out-of-memory
condition
- Cosmetic tweak for arm64 KASLR loading logic
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZpTg5gAKCRAwbglWLn0t
XOrOAQCpZjtjkPRPCBY+t3wUl84rOKiPr1SMHyL50Zl8udJKegD/bnwWSgX3FzLQ
TN+xjnK7IAxEoKAEWt8lnt04cH5r3As=
=7VWO
-----END PGP SIGNATURE-----
Merge tag 'efi-next-for-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:
"Note the removal of the EFI fake memory map support - this is believed
to be unused and no longer worth supporting. However, we could easily
bring it back if needed.
With recent developments regarding confidential VMs and unaccepted
memory, combined with kexec, creating a known inaccurate view of the
firmware's memory map and handing it to the OS is a feature we can
live without, hence the removal. Alternatively, I could imagine making
this feature mutually exclusive with those confidential VM related
features, but let's try simply removing it first.
Summary:
- Drop support for the 'fake' EFI memory map on x86
- Add an SMBIOS based tweak to the EFI stub instructing the firmware
on x86 Macbook Pros to keep both GPUs enabled
- Replace 0-sized array with flexible array in EFI memory attributes
table handling
- Drop redundant BSS clearing when booting via the native PE
entrypoint on x86
- Avoid returning EFI_SUCCESS when aborting on an out-of-memory
condition
- Cosmetic tweak for arm64 KASLR loading logic"
* tag 'efi-next-for-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: Replace efi_memory_attributes_table_t 0-sized array with flexible array
efi: Rename efi_early_memdesc_ptr() to efi_memdesc_ptr()
arm64/efistub: Clean up KASLR logic
x86/efistub: Drop redundant clearing of BSS
x86/efistub: Avoid returning EFI_SUCCESS on error
x86/efistub: Call Apple set_os protocol on dual GPU Intel Macs
x86/efistub: Enable SMBIOS protocol handling for x86
efistub/smbios: Simplify SMBIOS enumeration API
x86/efi: Drop support for fake EFI memory maps
VM Service Module (SVSM).
When running over a SVSM, different services can run at different
protection levels, apart from the guest OS but still within the
secure SNP environment. They can provide services to the guest, like
a vTPM, for example.
This series adds the required facilities to interface with such a SVSM
module.
- The usual fixlets, refactoring and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaWQuoACgkQEsHwGGHe
VUrmEw/+KqM5DK5cfpue3gn0RfH6OYUoFxOdYhGkG53qUMc3c3ka5zPVqLoHPkzp
WPXha0Z5pVdrcD9mKtVUW9RIuLjInCM/mnoNc3tIUL+09xxemAjyG1+O+4kodiU7
sZ5+HuKUM2ihoC4Rrm+ApRrZfH4+WcgQNvFky77iObWVBo4yIscS7Pet/MYFvuuz
zNaGp2SGGExDeoX/pMQNI3S9FKYD26HR17AUI3DHpS0teUl2npVi4xDjFVYZh0dQ
yAhTKbSX3Q6ekDDkvAQUbxvWTJw9qoIsvLO9dvZdx6SSWmzF9IbuECpQKGQwYcp+
pVtcHb+3MwfB+nh5/fHyssRTOZp1UuI5GcmLHIQhmhQwCqPgzDH6te4Ud1ovkxOu
3GoBre7KydnQIyv12I+56/ZxyPbjHWmn8Fg106nAwGTdGbBJhfcVYfPmPvwpI4ib
nXpjypvM8FkLzLAzDK6GE9QiXqJJlxOn7t66JiH/FkXR4gnY3eI8JLMfnm5blAb+
97LC7oyeqtstWth9/4tpCILgPR2tirrMQGjUXttgt+2VMzqnEamnFozsKvR95xok
4j6ulKglZjdpn0ixHb2vAzAcOJvD7NP147jtCmXH7M6/f9H1Lih3MKdxX98MVhWB
wSp16udXHzu5lF45J0BJG8uejSgBI2y51jc92HLX7kRULOGyaEo=
=u15r
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Add support for running the kernel in a SEV-SNP guest, over a Secure
VM Service Module (SVSM).
When running over a SVSM, different services can run at different
protection levels, apart from the guest OS but still within the
secure SNP environment. They can provide services to the guest, like
a vTPM, for example.
This series adds the required facilities to interface with such a
SVSM module.
- The usual fixlets, refactoring and cleanups
[ And as always: "SEV" is AMD's "Secure Encrypted Virtualization".
I can't be the only one who gets all the newer x86 TLA's confused,
can I?
- Linus ]
* tag 'x86_sev_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/ABI/configfs-tsm: Fix an unexpected indentation silly
x86/sev: Do RMP memory coverage check after max_pfn has been set
x86/sev: Move SEV compilation units
virt: sev-guest: Mark driver struct with __refdata to prevent section mismatch
x86/sev: Allow non-VMPL0 execution when an SVSM is present
x86/sev: Extend the config-fs attestation support for an SVSM
x86/sev: Take advantage of configfs visibility support in TSM
fs/configfs: Add a callback to determine attribute visibility
sev-guest: configfs-tsm: Allow the privlevel_floor attribute to be updated
virt: sev-guest: Choose the VMPCK key based on executing VMPL
x86/sev: Provide guest VMPL level to userspace
x86/sev: Provide SVSM discovery support
x86/sev: Use the SVSM to create a vCPU when not in VMPL0
x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0
x86/sev: Use kernel provided SVSM Calling Areas
x86/sev: Check for the presence of an SVSM in the SNP secrets page
x86/irqflags: Provide native versions of the local_irq_save()/restore()
teaching resctrl to handle scopes due to the clustering which
partitions the L3 cache into sets. Modify and extend the subsystem to
handle such scopes properly
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaWGPcACgkQEsHwGGHe
VUrLtQ/9GnY6EZDXQf6gF50FuasOrjaJw3bzSN6N0Hy28BEgG0fFrZzAKYRUvJXl
s16JkgQrQB3JaoT4bwcaSvMvBTtc+1cDuxMYI3C7jtBkjGFRwOgsCp/Hr2xujaKK
IfOJNmDLx2YRuxFyfi1FK4b1YqZ1gtg5FcmmaelBCu/rkQcBC9S7VtqGqCjwmhxy
l5WVDzMdXB++cxEJz1fBCyjdPgAwhEmNm0fnxGc0je1EvJUczd2o8Us3ND8Sw5x1
+5JL4PjwSMlFa71yw+rTzUs9u01SAI3IxvU6sPhmxhr3O4is4rGusyUldiz1598r
U+bYWivGn1ksVPifo0c6UUtbpaO9KLAnxsiRct7FKZdBfaqXi13twi1918aVyECJ
8pW0R8c/W3kQYMPOlhwBIzJp31rPzAxu70k9DT0cShAzKk/EbIWZAuZGqMz9bhfS
pcfCdD+36C/jN57KIhzo3GamzgHee40MQMLBKjFe1etZFit2EjyUK/jZhdYZWckj
+mOyWLngLVzF2mIkFrmw4VDRHsSqZlBGSHwHyiC+J+lL+nO9N9xQrtxm4z8TimLY
QquDSTYdqi2dGYVpN4vIOktn40A43UxirKC1X3fVqQRz71LcYGe28tMlQ99kUUJR
H8PGajlxfSB1CWNZpgaHGTMzU09ojHvJYmXy2p1HJf4fcBiXOV4=
=LITm
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Enable Sub-NUMA clustering to work with resource control on Intel by
teaching resctrl to handle scopes due to the clustering which
partitions the L3 cache into sets. Modify and extend the subsystem to
handle such scopes properly
* tag 'x86_cache_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Update documentation with Sub-NUMA cluster changes
x86/resctrl: Detect Sub-NUMA Cluster (SNC) mode
x86/resctrl: Enable shared RMID mode on Sub-NUMA Cluster (SNC) systems
x86/resctrl: Make __mon_event_count() handle sum domains
x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter
x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode
x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files
x86/resctrl: Allocate a new field in union mon_data_bits
x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
x86/resctrl: Initialize on-stack struct rmid_read instances
x86/resctrl: Add a new field to struct rmid_read for summation of domains
x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files
x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
x86/resctrl: Introduce snc_nodes_per_l3_cache
x86/resctrl: Add node-scope to the options for feature scope
x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
x86/resctrl: Prepare for different scope for control/monitor operations
x86/resctrl: Prepare to split rdt_domain structure
x86/resctrl: Prepare for new domain scope
explicitly specify the flag if there's a valid reason to show it in
/proc/cpuinfo
- Switch a bunch of Intel x86 model checking code to the new CPU model
defines
- Fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVZ+EACgkQEsHwGGHe
VUqTgA//aJez6C5SmuqIofqgimr+8JGNThf4vFB3O9tN0ony3IR8IRieF+sOZFXE
WVyN7KOhPs2XvNzVAaJpzWUcg/E2bXzVrOKfx3uFiyNiBttKLVot7Hl640wqWGoG
eTViTpQ6IALY7lEI6vFNXz+4Ja5PWmHxWdBkvP9ehSvqNxHivTWL4HQ11pcCWQEA
i+V37PbOHsnH7ZprJtaV0ihtjFblk9/R4qoZuT3SObhG0QDJK4Q7yYUelxXMUUgD
Yo3nXluQl6Vc5dD2ULYkTlhzMxoZUMURty897vYSsZz49ZXsS6fsvd+BheSQVOv1
hzaqqFYijdIpPI1zwgAPM+e6S/EAafpNVcEkjhHGZIJehwXm3teoSlX5tK2NPGoe
PLYrwPWAzagdS3dWvrvBYT3Bu7pygieDSyPFfVP2XQsElHsWhYvBtxeH/uUwm+v4
xjtXaJUj9eznChPaDZhCl8ioh9szUKHsh2NJ5ND7qpxPCFpz1Xj9ZmbIYTjHEgjG
IT8dFfykKdyh5htJWw/P8LbexpEMTmu/LDrDXt+tFsDLBKIkeLiP3h8+yDR+vJ7K
OGBjY2ciSi9Wy9ynunCOCNHNBdia1qc3AJWSg/2YP4NW+RzRLe6cIs+Ih4s1N5lx
ADvw+TA9CAKo1KASyOVYAxq7h4xlsyH6jbCC3ZW3P/a+Bs8smqM=
=SEED
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu model updates from Borislav Petkov:
- Flip the logic to add feature names to /proc/cpuinfo to having to
explicitly specify the flag if there's a valid reason to show it in
/proc/cpuinfo
- Switch a bunch of Intel x86 model checking code to the new CPU model
defines
- Fixes and cleanups
* tag 'x86_cpu_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/intel: Drop stray FAM6 check with new Intel CPU model defines
x86/cpufeatures: Flip the /proc/cpuinfo appearance logic
x86/CPU/AMD: Always inline amd_clear_divider()
x86/mce/inject: Add missing MODULE_DESCRIPTION() line
perf/x86/rapl: Switch to new Intel CPU model defines
x86/boot: Switch to new Intel CPU model defines
x86/cpu: Switch to new Intel CPU model defines
perf/x86/intel: Switch to new Intel CPU model defines
x86/virt/tdx: Switch to new Intel CPU model defines
x86/PCI: Switch to new Intel CPU model defines
x86/cpu/intel: Switch to new Intel CPU model defines
x86/platform/intel-mid: Switch to new Intel CPU model defines
x86/pconfig: Remove unused MKTME pconfig code
x86/cpu: Remove useless work in detect_tme_early()
environments
- Remove duplicated Spectre cmdline option documentation
- Add separate macro definitions for syscall handlers which do not
return in order to address objtool warnings
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVXXMACgkQEsHwGGHe
VUrd3A/9FFJZcpxdpWJikyEskb3CO1xthfM/6QvV5U3/Nldpz4aROEteqsMYc+xB
OcA/RkCc8mBBFuydZjNxlNwyMXkoab/rQJC/Dz7q1O61sho4RWk8yCh6xM1JRofF
WeKGCClz1KnsCc8FlVaHAEhp6gBMJiiqawjXBklfHhUqmbY7UZgcAyeM3uMIwAEG
qCS7opOSZVijJadoyvROf5na23hggUVO++qS4HYT66G3bI3MdEEWp06dUxXBD/Er
2zRAY6III4wuGTxe8L49ftsyW9RS7AKY2rUmhpffkeA8tLYBfXogYVSQYyR3S9Ou
gZg9Yeu64rjqZZUYpzRR+kATUpuSKO6nQBHxd+ICRIUbzSmXUNzvPTi5SWSWh2vC
HTLgFbGXxg8fLlpqCJ21oaU982w3eteOJ+wgf/AH3hBykFljck9EcaGsaQ5OfeDE
MA0XaDy2V4jypyxmLpRfRIWJWtNVTgza2Jl0Dg3X+UipAXtvCvJzW1ZJ0ksA+2P0
K1GeWy4tC51uFndeYpNC1eQ0cJjv1mfAugHcqgVdAhwMYUZdXchaPJHr/fcF7AEG
xjV7fnoGK6WKKUni+Tnmom3FzBVDztKAtZ4iYgwIWReRj9bKLhP2k779rMXkCftt
WtiencSCtVn+K/4acYBx0vbRKlDv769Lq64FZ8xNgGw6uRXjhhM=
=AP9P
-----END PGP SIGNATURE-----
Merge tag 'x86_bugs_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu mitigation updates from Borislav Petkov:
- Add a spectre_bhi=vmexit mitigation option aimed at cloud
environments
- Remove duplicated Spectre cmdline option documentation
- Add separate macro definitions for syscall handlers which do not
return in order to address objtool warnings
* tag 'x86_bugs_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Add 'spectre_bhi=vmexit' cmdline option
x86/bugs: Remove duplicate Spectre cmdline option descriptions
x86/syscall: Mark exit[_group] syscall handlers __noreturn
callers instead of them doing homegrown solutions. This will provide for
adding API support for confidential computing solutions like TDX
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVO4kACgkQEsHwGGHe
VUp4tw//en2ywe8nqoO8a5WIxIcc6wtMTYEboqu5q7RzWJzHVRsAz72USeMlQgBB
ywNnn2H0SgVqfcLOMkAzsEarvPUJR0ZThvYxyIStcFzqIWMbtuhazMx/tVsR+9jD
LqIFWrSeXPE+w005srnXZb7qxvC4cDyGdRL9xHa6UoN/Io2oTEidNWs825KoLWhN
OPqWfLrvm+Bb+JMaLYQC6UQsJk1ds91WlI3k7CdYk1sNgkTfwGHlDulwrhzM0oG0
EcVBKW8xsOxg4ylYS5j42ykE1z+FUMpSQ+tq7fo/SUbrgTr55xhDpxi8rsS2P5xX
fErsYBOEY228YT8V1fpaJMY1f7HLhZqy5jrODvDHCI6E3wasQuzl9Dc+OpwmN5NA
gR9BQIoAQgpZSpTsCG6qJagx5FYmS3bY1yXmzEsTmrzmchXQ0QQqInJw61qdHO4F
+LZYj7pOQzKlVEkrpBeWMnWMh+RmumaW0SsHVahvutzH3OA3yLjZl117S3dDiY7K
A4cqaX4A0KeCSUkXha7NuSRDtDIevAYhIEvcoUr5Xv2FgRO2c7N1rzzCdH3ML0fZ
Pzmjh24s91YqxY/s0YnJ57glKJfGcx0VKzPaw80/rxJ9sVb4HK2GkBOODuJhP8Iw
rF8qIfEmRHsyJdvRkF6pSl7hIEJth/khW0qNRF8PivzCtnpDBO8=
=4VPt
-----END PGP SIGNATURE-----
Merge tag 'x86_vmware_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vmware updates from Borislav Petkov:
- Add a unified VMware hypercall API layer which should be used by all
callers instead of them doing homegrown solutions. This will provide
for adding API support for confidential computing solutions like TDX
* tag 'x86_vmware_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vmware: Add TDX hypercall support
x86/vmware: Remove legacy VMWARE_HYPERCALL* macros
x86/vmware: Correct macro names
x86/vmware: Use VMware hypercall API
drm/vmwgfx: Use VMware hypercall API
input/vmmouse: Use VMware hypercall API
ptp/vmware: Use VMware hypercall API
x86/vmware: Introduce VMware hypercall API
they're the only ones who can interpret the results properly
- The usual cleanups and fixes, left and right
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVOU0ACgkQEsHwGGHe
VUqeFBAAl9X4bj08GwSAXfqBangXaGpKO4Nx0VZiFCYDkQ/TDnchMEBbpRWSuVzS
SEnVSrcAXCxKqhv295UyFMmv2a+q3UUidkxTzRfznekMZMMylHYcfCFrg16w9ZNJ
N/cBquTu96hSJHd2/usNUvNPLllTrMoIg3gofBav+NTaHQQDmzvM5htfewREY9OF
SRS/86o3u5oIsRKKiJRyzfLzzX9lEGUvU+lvxv/yu1x2Q6SG0guhfM3HeaSxCIOs
yeB23bwe/N/pO5KlqOtEJJL49Ypu2k/jfiS2rhH6AxSqNfXVpBlDbnahu9sA973n
irzWwycJhVU4OQ3pqmPXdcKDqn7GmUWDsjrkEIOqJeBCSukmlM7APi8Ss8yGZ3X4
HgDw10c900ldrxSo0H5PdpeULvowpeptpzBY8gzcdum4s0vNUvZLy/n1AKo7ydea
oJ+ZBdXvywnR66uGQLkTxLvpGTNgyFrKDORHuyOAwJTN5CbLuco2SV/82mkcQCZt
sAgyiWFvIcLoHZPfY8BNztYWVX01lWDIxFHJE8ca/B97mBeZCC3w1DnHJla8Kxsg
zCMV0yn61BdMvjVS9AGaKqEuN0gYYrs/QOjtOp5ggAv7QC1ke/wqgZoFGvLbmcP9
pIf8GzCt34u3tACGAl76toP0rtnMjGvKD8xXdHGHf7AAj1jKo28=
=rd6Q
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
- Make error checking of AMD SMN accesses more robust in the callers as
they're the only ones who can interpret the results properly
- The usual cleanups and fixes, left and right
* tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kmsan: Fix hook for unaligned accesses
x86/platform/iosf_mbi: Convert PCIBIOS_* return codes to errnos
x86/pci/xen: Fix PCIBIOS_* return code handling
x86/pci/intel_mid_pci: Fix PCIBIOS_* return code handling
x86/of: Return consistent error type from x86_of_pci_irq_enable()
hwmon: (k10temp) Rename _data variable
hwmon: (k10temp) Remove unused HAVE_TDIE() macro
hwmon: (k10temp) Reduce k10temp_get_ccd_support() parameters
hwmon: (k10temp) Define a helper function to read CCD temperature
x86/amd_nb: Enhance SMN access error checking
hwmon: (k10temp) Check return value of amd_smn_read()
EDAC/amd64: Check return value of amd_smn_read()
EDAC/amd64: Remove unused register accesses
tools/x86/kcpuid: Add missing dir via Makefile
x86, arm: Add missing license tag to syscall tables files
conflicts:
- Carve out CPU hotplug function declarations into a separate header
with the goal to be able to use the lockdep assertions in a more
flexible manner
- As a result, refactor cacheinfo code after carving out a function
to return the cache ID associated with a given cache level
- Cleanups
- Add support to be able to kexec TDX guests. For that
- Expand ACPI MADT CPU offlining support
- Add machinery to prepare CoCo guests memory before kexec-ing into a new
kernel
- Cleanup, readjust and massage related code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVCYoACgkQEsHwGGHe
VUoi6g//Up/4vMzcjqzrndXfl0aP+NpK4zNud+ZPP4Qza2yPhKydniMvkWVQ8DTx
jQaGk/tJDeFG6ofOzGkmBGyuZzuO4D7E0XFyXZZeVgSvdk2Af5vaWu1D3e4i4MiM
Ox4H8NtWnC4MozP0hos4qB0vtYaBWVJkNvIXDVF6162zLwEmbuyrpFe3glscwIxv
hMZR/C47RHcEeOb7yA4m/gJ+AqMe9OKradoNJkkfDpnYr6CYsbmpY09or2WYuvoI
0gevkIe6Q9HMcq3CQl6/pR8IgbA5VmGi7iCiE1ihgTPwR3AaU8llzBqYdSgezFrk
68A7oGeUZQeifQgjwkreZclMtsGEeGWVOB0Bh3Jgr6uaWGFXtpydi/hc73wbTz+F
IazKQcKQYjaPW/9UG+0+cFTQlCgQ+WxwqAsN1uqzL6gMgmC9B+TM//xzk5nVxpOd
ouf8T85tyceIPCKepGE/bWEHYYCjfbqBMyQT6RHmxUKbb1/PIsbzN26cenkZmPXT
cpwurWVG7mRQJRqTrsS+D+opP1h/jOdkpwGlBfl1s0sX6RZuMFBk+7TlMMs61Cyo
PWtrLV7Dr369cuXE72wIgfBAao2AS8kFshc7Atokq7/XfL9cCWHeqIcu7yvParP5
WY43YQv8XPGI7ZnPqULByTY0Wxg8TFk8whamx97kEp8uy2HmbQU=
=k+T+
-----END PGP SIGNATURE-----
Merge tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 confidential computing updates from Borislav Petkov:
"Unrelated x86/cc changes queued here to avoid ugly cross-merges and
conflicts:
- Carve out CPU hotplug function declarations into a separate header
with the goal to be able to use the lockdep assertions in a more
flexible manner
- As a result, refactor cacheinfo code after carving out a function
to return the cache ID associated with a given cache level
- Cleanups
Add support to be able to kexec TDX guests:
- Expand ACPI MADT CPU offlining support
- Add machinery to prepare CoCo guests memory before kexec-ing into a
new kernel
- Cleanup, readjust and massage related code"
* tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
ACPI: tables: Print MULTIPROC_WAKEUP when MADT is parsed
x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
x86/mm: Introduce kernel_ident_mapping_free()
x86/smp: Add smp_ops.stop_this_cpu() callback
x86/acpi: Do not attempt to bring up secondary CPUs in the kexec case
x86/acpi: Rename fields in the acpi_madt_multiproc_wakeup structure
x86/mm: Do not zap page table entries mapping unaccepted memory table during kdump
x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI ranges
x86/tdx: Convert shared memory back to private on kexec
x86/mm: Add callbacks to prepare encrypted memory for kexec
x86/tdx: Account shared memory
x86/mm: Return correct level from lookup_address() if pte is none
x86/mm: Make x86_platform.guest.enc_status_change_*() return an error
x86/kexec: Keep CR4.MCE set during kexec for TDX guest
x86/relocate_kernel: Use named labels for less confusion
cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
cpu/hotplug: Add support for declaring CPU offlining not supported
x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init
x86/acpi: Extract ACPI MADT wakeup code into a separate file
x86/kexec: Remove spurious unconditional JMP from from identity_mapped()
...
string has been built and thus arguments can get lost
- Code cleanups and simplifications
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVAgQACgkQEsHwGGHe
VUrXsxAAsnJiihOXaU/VPfuRx5d/URufo1HxLPjR5D0YXuzCEbFUS3/9UleAsg0Z
h/hKBPtC4o9OJWqo1EIbpmCaIqMxuYZgLEQ1n2tx60FGFVfY/9H8PmqPSgMdeoPC
HBseXzLzNy6BWeIbRIc3FCk1MF1HR83hs1aiaCJVBm19kmz4n4aZ4zRr4CNIug+0
6kNtLWiNYW2kw6J/2zoIStVkScIzxIFcMVz7KgA4S6RIOPLaints9Nf4jNl2mp5n
UEZy9OQEgf8h+3KI5dB5uUhckuteQSSeL6K0YJ869pRN63hOtU7MCc8PSgMpPAbX
4s/wKYRp2l4EfEOVCJimFs/yJKeIDjOW0ivuKJ/5DvqtyXG5PMBdt8HCBlpUb/cr
Qi4dd4/u1pUk/vJpykZq/5H6zDWym2Q2WDjOCE8K2DOi3YBY+Ia7HrBXSyQyYAJ6
Rq8Xu6Lq+Lqgg9/7HZizoc8y6wRyzhuYpkqJWvLN57rJ5dNNKKuJyuwCyAupw4o1
b4gfQ5KgUyG8VAs7dSqhEBzL8zrXZlbOhkeDXUUHtKw6AxS9p4LDIzKVwc6QHdAe
0V2soGoAYv24RoAEUeVEeaIHMkKdq600W/9yNFzogNvRvFyXp+jXCR3kCtNz6TJ2
VvioFlJw4y99UPguKi/nzyTA1EdAVVhYYgl39wTnMDOQHxSv2o0=
=GWkn
-----END PGP SIGNATURE-----
Merge tag 'x86_boot_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Borislav Petkov:
- Add a check to warn when cmdline parsing happens before the final
cmdline string has been built and thus arguments can get lost
- Code cleanups and simplifications
* tag 'x86_boot_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/setup: Warn when option parsing is done too early
x86/boot: Clean up the arch/x86/boot/main.c code a bit
x86/boot: Use current_stack_pointer to avoid asm() in init_heap()
need to "spell out" the number of alternates in an ALTERNATIVE_n() macro and
thus have an ever-increasing complexity in those definitions.
For ease of bisection, the old macros are converted to the new, nested
variants in a step-by-step manner so that in case an issue is encountered
during testing, one can pinpoint the place where it fails easier. Because
debugging alternatives is a serious pain.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaU/+MACgkQEsHwGGHe
VUpMGhAAqVbB3DZohv0Oa4BRRvaKFuQ3L7H0NTjK/pbT3EG+phol0zrHby2MnGjD
HWXskps86n91QBB/06vAyZRimV/dvAPvlSKllsRx6ie3VCE4FJPzA4nTWQn/41dC
HWamj78mQuSMgioLzIYdTY79KObtJcUw/X/xz+TTMemfkzkQxukKY7+Y71nZbuKi
rUuCSrfAWNHQaIaoGs2JowGw7te7yNOtKQMCW5TdNLwvJfOAECuoLIFeiEcWHvoO
uGl6FTABNLp26wmaeceUxdjBbTJcM3iV3joZQYED7B+mbJcU/a7tZw7I+mavPrbh
Y6+EOn7rzR0wbcmj0iJ74TKr+uKDme/Qzm3YEKgGvJPj9tRjTDwxWRBnyTeCMbav
NkKVwWTep8K+1qJtGVBwACY6iz89u3P8V5owD8O++KIPQa8rA0m8pN5gaU3PVYYQ
D2UUdqXWIPIFoD4Sveb/WFU8OJKY+Nx7IK8KD03h5tiXW8MmGSa2e5b57gIfCLP7
DbSHyCkTiqEdBrSM4/RaVVckD6NZ39M87H+iV51vYUkCYmODa/riMj0M7SVMi5Jo
S/30jvdHEzWnmDBbOsn9d1XbvB5I+zz3BrcZQ2VSyBx+Y9m+SZ9qsyMEkOWw4uKM
kfFp+XlYVWSnQFo7jY3UUIgVrU9dmdX1WxNX7/2HABDjg3MtWog=
=tGap
-----END PGP SIGNATURE-----
Merge tag 'x86_alternatives_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 alternatives updates from Borislav Petkov:
"This is basically PeterZ's idea to nest the alternative macros to
avoid the need to "spell out" the number of alternates in an
ALTERNATIVE_n() macro and thus have an ever-increasing complexity in
those definitions.
For ease of bisection, the old macros are converted to the new, nested
variants in a step-by-step manner so that in case an issue is
encountered during testing, one can pinpoint the place where it fails
easier.
Because debugging alternatives is a serious pain"
* tag 'x86_alternatives_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives, kvm: Fix a couple of CALLs without a frame pointer
x86/alternative: Replace the old macros
x86/alternative: Convert the asm ALTERNATIVE_3() macro
x86/alternative: Convert the asm ALTERNATIVE_2() macro
x86/alternative: Convert the asm ALTERNATIVE() macro
x86/alternative: Convert ALTERNATIVE_3()
x86/alternative: Convert ALTERNATIVE_TERNARY()
x86/alternative: Convert alternative_call_2()
x86/alternative: Convert alternative_call()
x86/alternative: Convert alternative_io()
x86/alternative: Convert alternative_input()
x86/alternative: Convert alternative_2()
x86/alternative: Convert alternative()
x86/alternatives: Add nested alternatives macros
x86/alternative: Zap alternative_ternary()
a MCA_MISC value only when one has actually been supplied by the user
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaU+QYACgkQEsHwGGHe
VUp5aBAAh3zED0mymD4Eiutfaue1Ccwa9b8OyvkFLSX+qEuvoHAbh5vZgwGyJum6
h1cVfw4v43XQE16DVcGXacITjrQfstb67cyj5AKMINZmCvWb+XVVngdavRJs7fsC
gPWSEtKPWzm+Vu8jk3IYKeAwqLQnlqDlwFp1rGkMyS8KPjBXrnEnh4+Q1QaxXvdF
XbtTRyKy1PVAA3cgAbW/FFRnZMj8KJQOSyXnYQEmPHTrPAvUufNjXmjCfV27H5/n
kNKDu+/fY9+Lj8nPrll0j4fXymVaacZvbnUrB2Um3twouyEyBRMHDQL+xRUxhRGO
iJs+lPBo/hW/+4Vx5BZt6jk4w95ZVldf/sIwxdMk9EDsheGg95WMd2gD1NU0gG+8
8wGKVYh/eowkuJcNj6EiSvXDeYMeEKPeB9JTButzRM39lAOJYHcJxDaLKjxldR7e
1HPETsc+tUP0ESuFWT4f1WSZjsc3J4fg89bx3I8gBeOgl5T8ykcpYsTuTdAzdfHM
xPiC2WTP9WYqnfCUXFSz/yUvaAa1O6TfW2hTqGf+rJMfDBlohk3QBR59JJjO4IkY
nkYKfc6BcIQtJZ9VQSS26f78P+NKALxReJNS83gPtq4sduDytpLCvLf4ts6dy3d7
rfr9IL+adutOOy/DjDZ6DVhITGKYjwJEKcAQ3fNBxn0PsSOZUjI=
=yg+A
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- A cleanup and a correction to the error injection driver to inject a
MCA_MISC value only when one has actually been supplied by the user
* tag 'ras_core_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Remove unused variable and return value in machine_check_poll()
x86/mce/inject: Only write MCA_MISC when a value has been supplied
- Core:
- Make the takeover of a hrtimer based broadcast timer reliable during
CPU hot-unplug. The current implementation suffers from a race which
can lead to broadcast timer starvation in the worst case.
- VDSO related cleanups and simplifications
- Small cleanups and enhancements all over the place
- PTP:
- Replace the architecture specific base clock to clocksource, e.g. ART
to TSC, conversion function with generic functionality to avoid
exposing such internals to drivers and convert all existing drivers
over. This also allows to provide functionality which converts the
other way round in the core code based on the same parameter set.
- Provide a function to convert CLOCK_REALTIME to the base clock to
support the upcoming PPS output driver on Intel platforms.
- Drivers:
- A set of Device Tree bindings for new hardware
- Cleanups and enhancements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmaUOM0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofolD/9kK+aYdDj1gCFuZXZ2wTgMMxFmf/91
0UcsGRuBJiIXs3H3iizQ0Mb0cdTW6qZJoBp0jPlvUSm0BEKdEgE1uRX2RuAPZ/Gq
4/54ZJVopKSgAqeJFmqQubRVSv2XdMRAAJT0o1oUG3jZ0c6u8vqArIh5ZCnu13l/
tsNOeYLYzQFyA30eHSJ/KjQ2zHwAhJnl5a/b7pdAvxmlN37bGgKEpglv+9zwFiDB
K/kWbpb/oED9WOmoQy5QYi8iSvLQHEhFGrqzXV3fegu/B/mBBf/bpsisVx7Z1m2R
nzxNqg86RdMjNR6giwBETZjm7YxM+gKb9nCBNILjbjWZFC4tyrBkLGJ+KniTRNyZ
M5R4X1oP/14h00qXmCgIEFWysXaJRewYI+TIm8R2rLXrR6Tf3c4oL6fHQJxy3X52
7A+4Z/vOk/KX6PxYmLC+xQDukhFh2nirVYsP1oNM9yC9zR/wkBBXTTmUSAI+8m8l
KphniSPS2HMSBI6TtgOT8SKY7lRUZTnafBZq7wRXCv0Zz8AXoofgQDmBkXC99BkB
MjLvRotJVJvY9a8LtA7htjDg/jiEMa0wHRNAGNSbflKoAKrJzoE5WbFxFZKbq3vZ
o8cEYRMAIP+X+qn+oymT45XXXQlifZiccJdAi9FqDTvplEib2jmTmH6Ae5Khkr4l
Lbzh/nSKVN7lOg==
=8GjP
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for timers, timekeeping and related functionality:
Core:
- Make the takeover of a hrtimer based broadcast timer reliable
during CPU hot-unplug. The current implementation suffers from a
race which can lead to broadcast timer starvation in the worst
case.
- VDSO related cleanups and simplifications
- Small cleanups and enhancements all over the place
PTP:
- Replace the architecture specific base clock to clocksource, e.g.
ART to TSC, conversion function with generic functionality to avoid
exposing such internals to drivers and convert all existing drivers
over. This also allows to provide functionality which converts the
other way round in the core code based on the same parameter set.
- Provide a function to convert CLOCK_REALTIME to the base clock to
support the upcoming PPS output driver on Intel platforms.
Drivers:
- A set of Device Tree bindings for new hardware
- Cleanups and enhancements all over the place"
* tag 'timers-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
clocksource/drivers/realtek: Add timer driver for rtl-otto platforms
dt-bindings: timer: Add schema for realtek,otto-timer
dt-bindings: timer: Add SOPHGO SG2002 clint
dt-bindings: timer: renesas,tmu: Add R-Car Gen2 support
dt-bindings: timer: renesas,tmu: Add RZ/G1 support
dt-bindings: timer: renesas,tmu: Add R-Mobile APE6 support
clocksource/drivers/mips-gic-timer: Correct sched_clock width
clocksource/drivers/mips-gic-timer: Refine rating computation
clocksource/drivers/sh_cmt: Address race condition for clock events
clocksource/driver/arm_global_timer: Remove unnecessary ‘0’ values from err
clocksource/drivers/arm_arch_timer: Remove unnecessary ‘0’ values from irq
tick/broadcast: Make takeover of broadcast hrtimer reliable
tick/sched: Combine WARN_ON_ONCE and print_once
x86/vdso: Remove unused include
x86/vgtod: Remove unused typedef gtod_long_t
x86/vdso: Fix function reference in comment
vdso: Add comment about reason for vdso struct ordering
vdso/gettimeofday: Clarify comment about open coded function
timekeeping: Add missing kernel-doc function comments
tick: Remove unnused tick_nohz_get_idle_calls()
...
Merge runtime constants infrastructure with implementations for x86 and
arm64.
This is one of four branches that came out of me looking at profiles of
my kernel build filesystem load on my 128-core Altra arm64 system, where
pathname walking and the user copies (particularly strncpy_from_user()
for fetching the pathname from user space) is very hot.
This is a very specialized "instruction alternatives" model where the
dentry hash pointer and hash count will be constants for the lifetime of
the kernel, but the allocation are not static but done early during the
kernel boot. In order to avoid the pointer load and dynamic shift, we
just rewrite the constants in the instructions in place.
We can't use the "generic" alternative instructions infrastructure,
because different architectures do it very differently, and it's
actually simpler to just have very specific helpers, with a fallback to
the generic ("old") model of just using variables for architectures that
do not implement the runtime constant patching infrastructure.
Link: https://lore.kernel.org/all/CAHk-=widPe38fUNjUOmX11ByDckaeEo9tN4Eiyke9u1SAtu9sA@mail.gmail.com/
* runtime-constants:
arm64: add 'runtime constant' support
runtime constants: add x86 architecture support
runtime constants: add default dummy infrastructure
vfs: dcache: move hashlen_hash() from callers into d_hash()
initialized in the code path anyway right after on the ARM arch
timer and the ARM global timer (Li kunyu)
- Fix a race condition in the interrupt leading to a deadlock on the
SH CMT driver. Note that this fix was not tested on the platform
using this timer but the fix seems reasonable enough to be picked
confidently (Niklas Söderlund)
- Increase the rating of the gic-timer and use the configured width
clocksource register on the MIPS architecture (Jiaxun Yang)
- Add the DT bindings for the TMU on the Renesas platforms (Geert
Uytterhoeven)
- Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas
Bonnefille)
- Add the rtl-otto timer driver along with the DT bindings for the
Realtek platform (Chris Packham)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGn3N4YVz0WNVyHskqDIjiipP6E8FAmaRQh0ACgkQqDIjiipP
6E+rfQgAqkAWZ9BjswxV8Fg+Hj+a1cSohKjDczqitQF5rJm25X5VvMwlXVa3XQGm
yemh4tKPpll02LOiYCTyqOWzNrkVS9VsoBd5rrYjRX5aSv7UD35EXklLj4P/INwX
O9CRGD6aK4Xbw66xxheYHSSh+2iRs2x2mq61+/VdcIBlAwpQo+vx7McRoJZZI+2t
NFIXw8RF5dDlmmAaqiB0WnPAtcOK3SDo9fu1LEAX1ZAzvbZriLo7XLnL7ibySWVe
BW1n7Ore6PN5Dvz7jMfTsOQsgAlVv6MPfp/s4EDqMfBLVqXNirzXrdhiee/ahnYP
vyzQyU5HPCMiIYS45mhJF0OyDd3wyw==
=wuYA
-----END PGP SIGNATURE-----
Merge tag 'timers-v6.11-rc1' of https://git.linaro.org/people/daniel.lezcano/linux into timers/core
Pull clocksource/event driver updates from Daniel Lezcano:
- Remove unnecessary local variables initialization as they will be
initialized in the code path anyway right after on the ARM arch
timer and the ARM global timer (Li kunyu)
- Fix a race condition in the interrupt leading to a deadlock on the
SH CMT driver. Note that this fix was not tested on the platform
using this timer but the fix seems reasonable enough to be picked
confidently (Niklas Söderlund)
- Increase the rating of the gic-timer and use the configured width
clocksource register on the MIPS architecture (Jiaxun Yang)
- Add the DT bindings for the TMU on the Renesas platforms (Geert
Uytterhoeven)
- Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas
Bonnefille)
- Add the rtl-otto timer driver along with the DT bindings for the
Realtek platform (Chris Packham)
Link: https://lore.kernel.org/all/91cd05de-4c5d-4242-a381-3b8a4fe6a2a2@linaro.org
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmaB0NweHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGkvwH/36UJRk/o6wvXnyH
E6QjCSWo2226APyWks22NjtC3I/8Iqdvkneuh6wG0qL2sXAB078EMjUq5R81bF8H
wWFBJwetjYTp8GEyLioMEb2wCH/J3R29dLFC4UYTplafXRGP6//xcpJaKmTxcgdR
31IzvTPXbApZ7L3k1U6rA2bK9PNKcFCOvZlrNMUCuwMrabymHsDfOUt1DqXyg2xp
zjqiWYBwlklozmgawSWt/mdEgkWuTcAbg+KyqDVQF59s9aj/OOwZ0j+HACq5V8CM
quTPIAYL6CC9p7uxa69lGr/sgC0Is/BZLPX7RTZAwCgarGvnX+1HUsjDcaFCtrVg
O6fPUV8=
=pgUx
-----END PGP SIGNATURE-----
Merge v6.10-rc6 into drm-next
The exynos-next pull is based on a newer -rc than drm-next. hence
backmerge first to make sure the unrelated conflicts we accumulated
don't end up randomly in the exynos merge pull, but are separated out.
Conflicts are all benign: Adjacent changes in amdgpu and fbdev-dma
code, and cherry-pick conflict in xe.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
There are two separate checks in prctl_enable_tagged_addr() that nr_bits
is in the correct range. The checks are arranged such the correct case
is sandwiched between both error cases, which do exactly the same thing.
Simplify the if condition and pull the correct case outside with the
rest of the success code path.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20240702132139.3332013-4-yosryahmed%40google.com
LAM can only be enabled when a process is single-threaded. But _kernel_
threads can temporarily use a single-threaded process's mm. That means
that a context-switching kernel thread can race and observe the mm's LAM
metadata (mm->context.lam_cr3_mask) change.
The context switch code does two logical things with that metadata:
populate CR3 and populate 'cpu_tlbstate.lam'. If it hits this race,
'cpu_tlbstate.lam' and CR3 can end up out of sync.
This de-synchronization is currently harmless. But it is confusing and
might lead to warnings or real bugs.
Update set_tlbstate_lam_mode() to take in the LAM mask and untag mask
instead of an mm_struct pointer, and while we are at it, rename it to
cpu_tlbstate_update_lam(). This should also make it clearer that we are
updating cpu_tlbstate. In switch_mm_irqs_off(), read the LAM mask once
and use it for both the cpu_tlbstate update and the CR3 update.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20240702132139.3332013-3-yosryahmed%40google.com
LAM can only be enabled when a process is single-threaded. But _kernel_
threads can temporarily use a single-threaded process's mm.
If LAM is enabled by a userspace process while a kthread is using its
mm, the kthread will not observe LAM enablement (i.e. LAM will be
disabled in CR3). This could be fine for the kthread itself, as LAM only
affects userspace addresses. However, if the kthread context switches to
a thread in the same userspace process, CR3 may or may not be updated
because the mm_struct doesn't change (based on pending TLB flushes). If
CR3 is not updated, the userspace thread will run incorrectly with LAM
disabled, which may cause page faults when using tagged addresses.
Example scenario:
CPU 1 CPU 2
/* kthread */
kthread_use_mm()
/* user thread */
prctl_enable_tagged_addr()
/* LAM enabled on CPU 2 */
/* LAM disabled on CPU 1 */
context_switch() /* to CPU 1 */
/* Switching to user thread */
switch_mm_irqs_off()
/* CR3 not updated */
/* LAM is still disabled on CPU 1 */
Synchronize LAM enablement by sending an IPI to all CPUs running with
the mm_struct to enable LAM. This makes sure LAM is enabled on CPU 1
in the above scenario before prctl_enable_tagged_addr() returns and
userspace starts using tagged addresses, and before it's possible to
run the userspace process on CPU 1.
In switch_mm_irqs_off(), move reading the LAM mask until after
mm_cpumask() is updated. This ensures that if an outdated LAM mask is
written to CR3, an IPI is received to update it right after IRQs are
re-enabled.
[ dhansen: Add a LAM enabling helper and comment it ]
Fixes: 82721d8b25 ("x86/mm: Handle LAM on context switch")
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20240702132139.3332013-2-yosryahmed%40google.com
There isn't a simple hardware bit that indicates whether a CPU is running in
Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the number of CPUs
sharing the L3 cache with CPU0 to the number of CPUs in the same NUMA node as
CPU0.
Add the missing definition of pr_fmt() to monitor.c. This wasn't noticed
before as there are only "can't happen" console messages from this file.
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-19-tony.luck@intel.com
Hardware has two RMID configuration options for SNC systems. The default
mode divides RMID counters between SNC nodes. E.g. with 200 RMIDs and
two SNC nodes per L3 cache RMIDs 0..99 are used on node 0, and 100..199
on node 1. This isn't compatible with Linux resctrl usage. On this
example system a process using RMID 5 would only update monitor counters
while running on SNC node 0.
The other mode is "RMID Sharing Mode". This is enabled by clearing bit
0 of the RMID_SNC_CONFIG (0xCA0) model specific register. In this mode
the number of logical RMIDs is the number of physical RMIDs (from CPUID
leaf 0xF) divided by the number of SNC nodes per L3 cache instance. A
process can use the same RMID across different SNC nodes.
See the "Intel Resource Director Technology Architecture Specification"
for additional details.
When SNC is enabled, update the MSR when a monitor domain is marked
online. Technically this is overkill. It only needs to be done once
per L3 cache instance rather than per SNC domain. But there is no harm
in doing it more than once, and this is not in a critical path.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240702173820.90368-3-tony.luck@intel.com
Legacy resctrl monitor files must provide the sum of event values across
all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance.
There are now two cases:
1) A specific domain is provided in struct rmid_read
This is either a non-SNC system, or the request is to read data
from just one SNC node.
2) Domain pointer is NULL. In this case the cacheinfo field in struct
rmid_read indicates that all SNC nodes that share that L3 cache
instance should have the event read and return the sum of all
values.
Update the CPU sanity check. The existing check that an event is read
from a CPU in the requested domain still applies when reading a single
domain. But when summing across domains a more relaxed check that the
current CPU is in the scope of the L3 cache instance is appropriate
since the MSRs to read events are scoped at L3 cache level.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-17-tony.luck@intel.com
mon_event_read() fills out most fields of the struct rmid_read that is passed
via an smp_call*() function to a CPU that is part of the correct domain to
read the monitor counters.
With Sub-NUMA Cluster (SNC) mode there are now two cases to handle:
1) Reading a file that returns a value for a single domain.
+ Choose the CPU to execute from the domain cpu_mask
2) Reading a file that must sum across domains sharing an L3 cache
instance.
+ Indicate to called code that a sum is needed by passing a NULL
rdt_mon_domain pointer.
+ Choose the CPU from the L3 shared_cpu_map.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-16-tony.luck@intel.com
In SNC mode, there are multiple subdirectories in each L3 level monitor
directory (one for each SNC node). If all the CPUs in an SNC node are taken
offline, just remove the SNC directory for that node. In non-SNC mode, or when
the last SNC node directory is removed, remove the L3 monitor directory.
Add a helper function to avoid duplicated code.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240702173820.90368-2-tony.luck@intel.com
When SNC mode is enabled, create subdirectories and files to monitor at the SNC
node granularity. Legacy behavior is preserved by tagging the monitor files at
the L3 granularity with the "sum" attribute. When the user reads these files
the kernel will read monitor data from all SNC nodes that share the same L3
cache instance and return the aggregated value to the user.
Note that the "domid" field for files that must sum across SNC domains has the
L3 cache instance id, while non-summing files use the domain id.
The "sum" files do not need to make a call to mon_event_read() to initialize
the MBM counters. This will be handled by initializing the individual SNC nodes
that share the L3.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-14-tony.luck@intel.com
When Sub-NUMA Cluster (SNC) mode is enabled, the legacy monitor reporting files
must report the sum of the data from all of the SNC nodes that share the L3
cache that is referenced by the monitor file.
Resctrl squeezes all the attributes of these files into 32 bits so they can be
stored in the "priv" field of struct kernfs_node.
Currently, only three monitor events are defined by enum resctrl_event_id so
reducing it from 8 bits to 7 bits still provides more than enough space to
represent all the known event types.
But note that this choice was arbitrary. The "rid" field is also far wider
than needed for the current number of resource id types. This structure is
purely internal to resctrl, no ABI issues with modifying it. Subsequent changes
may rearrange the allocation of bits between each of the fields as needed.
Give the bit to a new "sum" field that indicates that reading this file must
sum across SNC nodes. This bit also indicates that the domid field is the id of
an L3 cache (instead of a domain id) to find which domains must be summed.
Fix up other issues in the kerneldoc description for mon_data_bits.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-13-tony.luck@intel.com
In Sub-NUMA Cluster (SNC) mode Linux must create the monitor
files in the original "mon_L3_XX" directories and also in each
of the "mon_sub_L3_YY" directories.
Refactor mkdir_mondata_subdir() to move the creation of monitoring files
into a helper function to avoid the need to duplicate code later.
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-12-tony.luck@intel.com
New semantics rely on some struct rmid_read members having NULL values to
distinguish between the SNC and non-SNC scenarios. resctrl can thus no longer
rely on this struct not being initialized properly.
Initialize all on-stack declarations of struct rmid_read:
rdtgroup_mondata_show()
mbm_update()
mkdir_mondata_subdir()
to ensure that garbage values from the stack are not passed down to other
functions.
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-11-tony.luck@intel.com
When a user reads a monitor file rdtgroup_mondata_show() calls mon_event_read()
to package up all the required details into an rmid_read structure which is
passed across the smp_call*() infrastructure to code that will read data from
hardware and return the value (or error status) in the rmid_read structure.
Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require the
smp_call-ed code to sum event data from all domains that share an L3 cache.
Add a pointer to the L3 "cacheinfo" structure to struct rmid_read for the data
collection routines to use to pick the domains to be summed.
[ Reinette: the rmid_read structure has become complex enough so document each
of its fields and provide the kerneldoc documentation for struct rmid_read. ]
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-10-tony.luck@intel.com
When SNC is enabled, monitoring data is collected at the SNC node granularity,
but must be reported at L3-cache granularity for backwards compatibility in
addition to reporting at the node level.
Add a "ci" field to the rdt_mon_domain structure to save the cache information
about the enclosing L3 cache for the domain. This provides:
1) The cache id which is needed to compose the name of the legacy monitoring
directory, and to determine which domains should be summed to provide
L3-scoped data.
2) The shared_cpu_map which is needed to determine which CPUs can be used to
read the RMID counters with the MSR interface.
This is the first step to an eventual goal of monitor reporting files like this
(for a system with two SNC nodes per L3):
$ cd /sys/fs/resctrl/mon_data
$ tree mon_L3_00
mon_L3_00 <- 00 here is L3 cache id
├── llc_occupancy \ These files provide legacy support
├── mbm_local_bytes > for non-SNC aware monitor apps
├── mbm_total_bytes / that expect data at L3 cache level
├── mon_sub_L3_00 <- 00 here is SNC node id
│ ├── llc_occupancy \ These files are finer grained
│ ├── mbm_local_bytes > data from each SNC node
│ └── mbm_total_bytes /
└── mon_sub_L3_01
├── llc_occupancy \
├── mbm_local_bytes > As above, but for node 1.
└── mbm_total_bytes /
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-9-tony.luck@intel.com
When SNC is enabled there is a mismatch between the MBA control function
which operates at L3 cache scope and the MBM monitor functions which
measure memory bandwidth on each SNC node.
Block use of the mba_MBps when scopes for MBA/MBM do not match.
Improve user diagnostics by adding invalfc() message when mba_MBps
is not supported.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-8-tony.luck@intel.com
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.
This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.
Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.
The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.
RMID sharing mode divides the physical RMIDs evenly between SNC nodes
but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system
with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two
SNC nodes per L3 cache instance would have 100 logical RMIDs available
for Linux to use. A task running on SNC node 0 with RMID 5 would
accumulate LLC occupancy and MBM bandwidth data in physical RMID 5.
Another task using RMID 5, but running on SNC node 1 would accumulate
data in physical RMID 105.
Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.
Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
how many SNC domains share an L3 cache instance. Initialize this to
"1". Runtime detection of SNC mode will adjust this value.
Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) Add a function to convert from logical RMID values (assigned to
tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
to physical RMID values to load into IA32_QM_EVTSEL MSR when
reading counters on each SNC node.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-7-tony.luck@intel.com
Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.
Add RESCTRL_L3_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are divided between
nodes that share an L3 cache.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-6-tony.luck@intel.com
The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.
Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.
Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-5-tony.luck@intel.com
Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.
Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).
Create separate domain lists for control and monitor operations.
Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-4-tony.luck@intel.com
The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.
To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-3-tony.luck@intel.com
Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.
In preparation for features that are scoped at the NUMA node level,
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.
Clean up the error handling when looking up domains. Report invalid ids
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the call sites.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-2-tony.luck@intel.com
Between kexec and confidential VM support, handling the EFI memory maps
correctly on x86 is already proving to be rather difficult (as opposed
to other EFI architectures which manage to never modify the EFI memory
map to begin with)
EFI fake memory map support is essentially a development hack (for
testing new support for the 'special purpose' and 'more reliable' EFI
memory attributes) that leaked into production code. The regions marked
in this manner are not actually recognized as such by the firmware
itself or the EFI stub (and never have), and marking memory as 'more
reliable' seems rather futile if the underlying memory is just ordinary
RAM.
Marking memory as 'special purpose' in this way is also dubious, but may
be in use in production code nonetheless. However, the same should be
achievable by using the memmap= command line option with the ! operator.
EFI fake memmap support is not enabled by any of the major distros
(Debian, Fedora, SUSE, Ubuntu) and does not exist on other
architectures, so let's drop support for it.
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
objtool complains:
arch/x86/kvm/kvm.o: warning: objtool: .altinstr_replacement+0xc5: call without frame pointer save/setup
vmlinux.o: warning: objtool: .altinstr_replacement+0x2eb: call without frame pointer save/setup
Make sure %rSP is an output operand to the respective asm() statements.
The test_cc() hunk and ALT_OUTPUT_SP() courtesy of peterz. Also from him
add some helpful debugging info to the documentation.
Now on to the explanations:
tl;dr: The alternatives macros are pretty fragile.
If I do ALT_OUTPUT_SP(output) in order to be able to package in a %rsp
reference for objtool so that a stack frame gets properly generated, the
inline asm input operand with positional argument 0 in clear_page():
"0" (page)
gets "renumbered" due to the added
: "+r" (current_stack_pointer), "=D" (page)
and then gcc says:
./arch/x86/include/asm/page_64.h:53:9: error: inconsistent operand constraints in an ‘asm’
The fix is to use an explicit "D" constraint which points to a singleton
register class (gcc terminology) which ends up doing what is expected
here: the page pointer - input and output - should be in the same %rdi
register.
Other register classes have more than one register in them - example:
"r" and "=r" or "A":
‘A’
The ‘a’ and ‘d’ registers. This class is used for
instructions that return double word results in the ‘ax:dx’
register pair. Single word values will be allocated either in
‘ax’ or ‘dx’.
so using "D" and "=D" just works in this particular case.
And yes, one would say, sure, why don't you do "+D" but then:
: "+r" (current_stack_pointer), "+D" (page)
: [old] "i" (clear_page_orig), [new1] "i" (clear_page_rep), [new2] "i" (clear_page_erms),
: "cc", "memory", "rax", "rcx")
now find the Waldo^Wcomma which throws a wrench into all this.
Because that silly macro has an "input..." consume-all last macro arg
and in it, one is supposed to supply input *and* clobbers, leading to
silly syntax snafus.
Yap, they need to be cleaned up, one fine day...
Closes: https://lore.kernel.org/oe-kbuild-all/202406141648.jO9qNGLa-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240625112056.GDZnqoGDXgYuWBDUwu@fat_crate.local
The outer if () should have been dropped when switching to c->x86_vfm.
Fixes: 6568fc18c2 ("x86/cpu/intel: Switch to new Intel CPU model defines")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240529183605.17520-1-andrew.cooper3@citrix.com
The 'profile_pc()' function is used for timer-based profiling, which
isn't really all that relevant any more to begin with, but it also ends
up making assumptions based on the stack layout that aren't necessarily
valid.
Basically, the code tries to account the time spent in spinlocks to the
caller rather than the spinlock, and while I support that as a concept,
it's not worth the code complexity or the KASAN warnings when no serious
profiling is done using timers anyway these days.
And the code really does depend on stack layout that is only true in the
simplest of cases. We've lost the comment at some point (I think when
the 32-bit and 64-bit code was unified), but it used to say:
Assume the lock function has either no stack frame or a copy
of eflags from PUSHF.
which explains why it just blindly loads a word or two straight off the
stack pointer and then takes a minimal look at the values to just check
if they might be eflags or the return pc:
Eflags always has bits 22 and up cleared unlike kernel addresses
but that basic stack layout assumption assumes that there isn't any lock
debugging etc going on that would complicate the code and cause a stack
frame.
It causes KASAN unhappiness reported for years by syzkaller [1] and
others [2].
With no real practical reason for this any more, just remove the code.
Just for historical interest, here's some background commits relating to
this code from 2006:
0cb91a2293 ("i386: Account spinlocks to the caller during profiling for !FP kernels")
31679f38d8 ("Simplify profile_pc on x86-64")
and a code unification from 2009:
ef4512882d ("x86: time_32/64.c unify profile_pc")
but the basics of this thing actually goes back to before the git tree.
Link: https://syzkaller.appspot.com/bug?extid=84fe685c02cd112a2ac3 [1]
Link: https://lore.kernel.org/all/CAK55_s7Xyq=nh97=K=G1sxueOFrJDAvPOJAL4TPTCAYvmxO9_A@mail.gmail.com/ [2]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In cloud environments it can be useful to *only* enable the vmexit
mitigation and leave syscalls vulnerable. Add that as an option.
This is similar to the old spectre_bhi=auto option which was removed
with the following commit:
36d4fe147c ("x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto")
with the main difference being that this has a more descriptive name and
is disabled by default.
Mitigation switch requested by Maksim Davydov <davydov-max@yandex-team.ru>.
[ bp: Massage. ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/2cbad706a6d5e1da2829e5e123d8d5c80330148c.1719381528.git.jpoimboe@kernel.org
VMware hypercalls use I/O port, VMCALL or VMMCALL instructions. Add a call to
__tdx_hypercall() in order to support TDX guests.
No change in high bandwidth hypercalls, as only low bandwidth ones are supported
for TDX guests.
[ bp: Massage, clear on-stack struct tdx_module_args variable. ]
Co-developed-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-9-alexey.makhalov@broadcom.com
VCPU_RESERVED and LEGACY_X2APIC are not VMware hypercall commands. These are
bits in the return value of the VMWARE_CMD_GETVCPU_INFO command. Change
VMWARE_CMD_ prefix to GETVCPU_INFO_ one. And move the bit-shift
operation into the macro body.
Fixes: 4cca6ea04d ("x86/apic: Allow x2apic without IR on VMware platform")
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-7-alexey.makhalov@broadcom.com
Remove VMWARE_CMD macro and move to vmware_hypercall API.
No functional changes intended.
Use u32/u64 instead of uint32_t/uint64_t across the file.
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-6-alexey.makhalov@broadcom.com
Introduce a vmware_hypercall family of functions. It is a common implementation
to be used by the VMware guest code and virtual device drivers in architecture
independent manner.
The API consists of vmware_hypercallX and vmware_hypercall_hb_{out,in}
set of functions analogous to KVM's hypercall API. Architecture-specific
implementation is hidden inside.
It will simplify future enhancements in VMware hypercalls such as SEV-ES and
TDX related changes without needs to modify a caller in device drivers code.
Current implementation extends an idea from
bac7b4e843 ("x86/vmware: Update platform detection code for VMCALL/VMMCALL hypercalls")
to have a slow, but safe path vmware_hypercall_slow() earlier during the boot
when alternatives are not yet applied. The code inherits VMWARE_CMD logic from
the commit mentioned above.
Move common macros from vmware.c to vmware.h.
[ bp: Fold in a fix:
https://lore.kernel.org/r/20240625083348.2299-1-alexey.makhalov@broadcom.com ]
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-2-alexey.makhalov@broadcom.com
x86_of_pci_irq_enable() returns PCIBIOS_* code received from
pci_read_config_byte() directly and also -EINVAL which are not
compatible error types. x86_of_pci_irq_enable() is used as
(*pcibios_enable_irq) function which should not return PCIBIOS_* codes.
Convert the PCIBIOS_* return code from pci_read_config_byte() into
normal errno using pcibios_err_to_errno().
Fixes: 96e0a0797e ("x86: dtb: Add support for PCI devices backed by dtb nodes")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240527125538.13620-1-ilpo.jarvinen@linux.intel.com
In a TDX VM without paravisor, currently the default timer is the Hyper-V
timer, which depends on the slow VM Reference Counter MSR: the Hyper-V TSC
page is not enabled in such a VM because the VM uses Invariant TSC as a
better clocksource and it's challenging to mark the Hyper-V TSC page shared
in very early boot.
Lower the rating of the Hyper-V timer so the local APIC timer becomes the
the default timer in such a VM, and print a warning in case Invariant TSC
is unavailable in such a VM. This change should cause no perceivable
performance difference.
Cc: stable@vger.kernel.org # 6.6+
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20240621061614.8339-1-decui@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240621061614.8339-1-decui@microsoft.com>
I'm getting tired of telling people to put a magic "" in the
#define X86_FEATURE /* "" ... */
comment to hide the new feature flag from the user-visible
/proc/cpuinfo.
Flip the logic to make it explicit: an explicit "<name>" in the comment
adds the flag to /proc/cpuinfo and otherwise not, by default.
Add the "<name>" of all the existing flags to keep backwards
compatibility with userspace.
There should be no functional changes resulting from this.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240618113840.24163-1-bp@kernel.org
Since FineIBT performs checking at the destination, it is weaker against
attacks that can construct arbitrary executable memory contents. As such,
some system builders want to run with FineIBT disabled by default. Allow
the "cfi=kcfi" boot param mode to be selectable through Kconfig via the
newly introduced CONFIG_CFI_AUTO_DEFAULT.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240501000218.work.998-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
This implements the runtime constant infrastructure for x86, allowing
the dcache d_hash() function to be generated using as a constant for
hash table address followed by shift by a constant of the hash index.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit
6791e0ea30 ("x86/resctrl: Access per-rmid structures by index")
adds logic to map individual monitoring groups into a global index space used
for tracking allocated RMIDs.
Attempts to free the default RMID are ignored in free_rmid(), and this works
fine on x86.
With arm64 MPAM, there is a latent bug here however: on platforms with no
monitors exposed through resctrl, each control group still gets a different
monitoring group ID as seen by the hardware, since the CLOSID always forms part
of the monitoring group ID.
This means that when removing a control group, the code may try to free this
group's default monitoring group RMID for real. If there are no monitors
however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is
never allocated, leading to a splat when free_rmid() tries to dereference the
table.
One option would be to treat RMID 0 as special for every CLOSID, but this would
be ugly since bookkeeping still needs to be done for these monitoring group IDs
when there are monitors present in the hardware.
Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and
just do nothing if the hardware doesn't have monitors.
This fix mirrors the gating checks already present in
mkdir_rdt_prepare_rmid_alloc() and elsewhere.
No functional change on x86.
[ bp: Massage commit message. ]
Fixes: 6791e0ea30 ("x86/resctrl: Access per-rmid structures by index")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240618140152.83154-1-Dave.Martin@arm.com
To allow execution at a level other than VMPL0, an SVSM must be present.
Allow the SEV-SNP guest to continue booting if an SVSM is detected and
the hypervisor supports the SVSM feature as indicated in the GHCB
hypervisor features bitmap.
[ bp: Massage a bit. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/2ce7cf281cce1d0cba88f3f576687ef75dc3c953.1717600736.git.thomas.lendacky@amd.com
When an SVSM is present, the guest can also request attestation reports
from it. These SVSM attestation reports can be used to attest the SVSM
and any services running within the SVSM.
Extend the config-fs attestation support to provide such. This involves
creating four new config-fs attributes:
- 'service-provider' (input)
This attribute is used to determine whether the attestation request
should be sent to the specified service provider or to the SEV
firmware. The SVSM service provider is represented by the value
'svsm'.
- 'service_guid' (input)
Used for requesting the attestation of a single service within the
service provider. A null GUID implies that the SVSM_ATTEST_SERVICES
call should be used to request the attestation report. A non-null
GUID implies that the SVSM_ATTEST_SINGLE_SERVICE call should be used.
- 'service_manifest_version' (input)
Used with the SVSM_ATTEST_SINGLE_SERVICE call, the service version
represents a specific service manifest version be used for the
attestation report.
- 'manifestblob' (output)
Used to return the service manifest associated with the attestation
report.
Only display these new attributes when running under an SVSM.
[ bp: Massage.
- s/svsm_attestation_call/svsm_attest_call/g ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/965015dce3c76bb8724839d50c5dea4e4b5d598f.1717600736.git.thomas.lendacky@amd.com
Currently, the sev-guest driver uses the vmpck-0 key by default. When an
SVSM is present, the kernel is running at a VMPL other than 0 and the
vmpck-0 key is no longer available. If a specific vmpck key has not be
requested by the user via the vmpck_id module parameter, choose the
vmpck key based on the active VMPL level.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/b88081c5d88263176849df8ea93e90a404619cab.1717600736.git.thomas.lendacky@amd.com
Requesting an attestation report from userspace involves providing the VMPL
level for the report. Currently any value from 0-3 is valid because Linux
enforces running at VMPL0.
When an SVSM is present, though, Linux will not be running at VMPL0 and only
VMPL values starting at the VMPL level Linux is running at to 3 are valid. In
order to allow userspace to determine the minimum VMPL value that can be
supplied to an attestation report, create a sysfs entry that can be used to
retrieve the current VMPL level of the kernel.
[ bp: Add CONFIG_SYSFS ifdeffery. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/fff846da0d8d561f9fdaf297dcf8cd907545a25b.1717600736.git.thomas.lendacky@amd.com
The SVSM specification documents an alternative method of discovery for
the SVSM using a reserved CPUID bit and a reserved MSR. This is intended
for guest components that do not have access to the secrets page in
order to be able to call the SVSM (e.g. UEFI runtime services).
For the MSR support, a new reserved MSR 0xc001f000 has been defined. A #VC
should be generated when accessing this MSR. The #VC handler is expected
to ignore writes to this MSR and return the physical calling area address
(CAA) on reads of this MSR.
While the CPUID leaf is updated, allowing the creation of a CPU feature,
the code will continue to use the VMPL level as an indication of the
presence of an SVSM. This is because the SVSM can be called well before
the CPU feature is in place and a non-zero VMPL requires that an SVSM be
present.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/4f93f10a2ff3e9f368fd64a5920d51bf38d0c19e.1717600736.git.thomas.lendacky@amd.com
Using the RMPADJUST instruction, the VMSA attribute can only be changed
at VMPL0. An SVSM will be present when running at VMPL1 or a lower
privilege level.
In that case, use the SVSM_CORE_CREATE_VCPU call or the
SVSM_CORE_DESTROY_VCPU call to perform VMSA attribute changes. Use the
VMPL level supplied by the SVSM for the VMSA when starting the AP.
[ bp: Fix typo + touchups. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/bcdd95ecabe9723673b9693c7f1533a2b8f17781.1717600736.git.thomas.lendacky@amd.com
The PVALIDATE instruction can only be performed at VMPL0. If an SVSM is
present, it will be running at VMPL0 while the guest itself is then
running at VMPL1 or a lower privilege level.
In that case, use the SVSM_CORE_PVALIDATE call to perform memory
validation instead of issuing the PVALIDATE instruction directly.
The validation of a single 4K page is now explicitly identified as such
in the function name, pvalidate_4k_page(). The pvalidate_pages()
function is used for validating 1 or more pages at either 4K or 2M in
size. Each function, however, determines whether it can issue the
PVALIDATE directly or whether the SVSM needs to be invoked.
[ bp: Touchups. ]
[ Tom: fold in a fix for Coconut SVSM:
https://lore.kernel.org/r/234bb23c-d295-76e5-a690-7ea68dc1118b@amd.com ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/4c4017d8b94512d565de9ccb555b1a9f8983c69c.1717600736.git.thomas.lendacky@amd.com
MADT Multiprocessor Wakeup structure version 1 brings support for CPU offlining:
BIOS provides a reset vector where the CPU has to jump to for offlining itself.
The new TEST mailbox command can be used to test whether the CPU offlined itself
which means the BIOS has control over the CPU and can online it again via the
ACPI MADT wakeup method.
Add CPU offlining support for the ACPI MADT wakeup method by implementing custom
cpu_die(), play_dead() and stop_this_cpu() SMP operations.
CPU offlining makes it possible to hand over secondary CPUs over kexec, not
limiting the second kernel to a single CPU.
The change conforms to the approved ACPI spec change proposal. See the Link.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
Link: https://lore.kernel.org/r/20240614095904.1345461-19-kirill.shutemov@linux.intel.com
If the helper is defined, it is called instead of halt() to stop the CPU at the
end of stop_this_cpu() and on crash CPU shutdown.
ACPI MADT will use it to hand over the CPU to BIOS in order to be able to wake
it up again after kexec.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-17-kirill.shutemov@linux.intel.com
ACPI MADT doesn't allow to offline a CPU after it was onlined. This limits
kexec: the second kernel won't be able to use more than one CPU.
To prevent a kexec kernel from onlining secondary CPUs, invalidate the mailbox
address in the ACPI MADT wakeup structure which prevents a kexec kernel to use
it.
This is safe as the booting kernel has the mailbox address cached already and
acpi_wakeup_cpu() uses the cached value to bring up the secondary CPUs.
Note: This is a Linux specific convention and not covered by the ACPI
specification.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-16-kirill.shutemov@linux.intel.com
In order to support MADT wakeup structure version 1, provide more appropriate
names for the fields in the structure.
Rename 'mailbox_version' to 'version'. This field signifies the version of the
structure and the related protocols, rather than the version of the mailbox.
This field has not been utilized in the code thus far.
Rename 'base_address' to 'mailbox_address' to clarify the kind of address it
represents. In version 1, the structure includes the reset vector address. Clear
and distinct naming helps to prevent any confusion.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-15-kirill.shutemov@linux.intel.com
e820__end_of_ram_pfn() is used to calculate max_pfn which, among other things,
guides where direct mapping ends. Any memory above max_pfn is not going to be
present in the direct mapping.
e820__end_of_ram_pfn() finds the end of the RAM based on the highest
E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into
calculation.
Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI tables
and might be required by kernel to function properly.
Usually the problem is hidden because there is some E820_TYPE_RAM memory above
E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash memory as
E820_TYPE_RAM on boot. If the pre-allocated range is small, it can fit under the
last E820_TYPE_ACPI range.
Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover
E820_TYPE_ACPI memory.
The problem was discovered during debugging kexec for TDX guest. TDX guest uses
E820_TYPE_ACPI to store the unaccepted memory bitmap and pass it between the
kernels on kexec.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-13-kirill.shutemov@linux.intel.com
AMD SEV and Intel TDX guests allocate shared buffers for performing I/O.
This is done by allocating pages normally from the buddy allocator and
then converting them to shared using set_memory_decrypted().
On kexec, the second kernel is unaware of which memory has been
converted in this manner. It only sees E820_TYPE_RAM. Accessing shared
memory as private is fatal.
Therefore, the memory state must be reset to its original state before
starting the new kernel with kexec.
The process of converting shared memory back to private occurs in two
steps:
- enc_kexec_begin() stops new conversions.
- enc_kexec_finish() unshares all existing shared memory, reverting it
back to private.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-11-kirill.shutemov@linux.intel.com
TDX is going to have more than one reason to fail enc_status_change_prepare().
Change the callback to return errno instead of assuming -EIO. Change
enc_status_change_finish() too to keep the interface symmetric.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-8-kirill.shutemov@linux.intel.com
TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
that bit is cleared during CR4 register reprogramming during boot or kexec
flows, a #VE exception will be raised which the guest kernel cannot handle.
Therefore, make sure the CR4.MCE setting is preserved over kexec too and avoid
raising any #VEs.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240614095904.1345461-7-kirill.shutemov@linux.intel.com
That identity_mapped() function was loving that "1" label to the point of
completely confusing its readers.
Use named labels in each place for clarity.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-6-kirill.shutemov@linux.intel.com
ACPI MADT doesn't allow to offline a CPU after it has been woken up.
Currently, CPU hotplug is prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible user of
the wake up method. Any platform that uses ACPI MADT wakeup method cannot
offline CPU.
Disable CPU offlining on ACPI MADT wakeup enumeration.
This has no visible effects for users: currently, TDX guest is the only platform
that uses the ACPI MADT wakeup method.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-5-kirill.shutemov@linux.intel.com
acpi_mp_wake_mailbox_paddr and acpi_mp_wake_mailbox are initialized once during
ACPI MADT init and never changed.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-3-kirill.shutemov@linux.intel.com
In order to prepare for the expansion of support for the ACPI MADT
wakeup method, move the relevant code into a separate file.
Introduce a new configuration option to clearly indicate dependencies
without the use of ifdefs.
There have been no functional changes.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-2-kirill.shutemov@linux.intel.com
This seemingly straightforward JMP was introduced in the initial version
of the the 64bit kexec code without any explanation.
It turns out (check accompanying Link) it's likely a copy/paste artefact
from 32-bit code, where such a JMP could be used as a serializing
instruction for the 486's prefetch queue. On x86_64 that's not needed
because there's already a preceding write to cr4 which itself is
a serializing operation.
[ bp: Typos. Let's try this and see what cries out. If it does,
reverting it is trivial. ]
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/55bc0649-c017-49ab-905d-212f140a403f@citrix.com/
The routine is used on syscall exit and on non-AMD CPUs is guaranteed to
be empty.
It probably does not need to be a function call even on CPUs which do need the
mitigation.
[ bp: Make sure it is always inlined so that noinstr marking works. ]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613082637.659133-1-mjguzik@gmail.com
AMD Zen-based systems use a System Management Network (SMN) that
provides access to implementation-specific registers.
SMN accesses are done indirectly through an index/data pair in PCI
config space. The accesses can fail for a variety of reasons.
Include code comments to describe some possible scenarios.
Require error checking for callers of amd_smn_read() and amd_smn_write().
This is needed because many error conditions cannot be checked by these
functions.
[ bp: Touchup comment. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20240606-fix-smn-bad-read-v4-4-ffde21931c3f@amd.com
Adding uretprobe syscall instead of trap to speed up return probe.
At the moment the uretprobe setup/path is:
- install entry uprobe
- when the uprobe is hit, it overwrites probed function's return address
on stack with address of the trampoline that contains breakpoint
instruction
- the breakpoint trap code handles the uretprobe consumers execution and
jumps back to original return address
This patch replaces the above trampoline's breakpoint instruction with new
ureprobe syscall call. This syscall does exactly the same job as the trap
with some more extra work:
- syscall trampoline must save original value for rax/r11/rcx registers
on stack - rax is set to syscall number and r11/rcx are changed and
used by syscall instruction
- the syscall code reads the original values of those registers and
restore those values in task's pt_regs area
- only caller from trampoline exposed in '[uprobes]' is allowed,
the process will receive SIGILL signal otherwise
Even with some extra work, using the uretprobes syscall shows speed
improvement (compared to using standard breakpoint):
On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
current:
uretprobe-nop : 1.498 ± 0.000M/s
uretprobe-push : 1.448 ± 0.001M/s
uretprobe-ret : 0.816 ± 0.001M/s
with the fix:
uretprobe-nop : 1.969 ± 0.002M/s < 31% speed up
uretprobe-push : 1.910 ± 0.000M/s < 31% speed up
uretprobe-ret : 0.934 ± 0.000M/s < 14% speed up
On Amd (AMD Ryzen 7 5700U)
current:
uretprobe-nop : 0.778 ± 0.001M/s
uretprobe-push : 0.744 ± 0.001M/s
uretprobe-ret : 0.540 ± 0.001M/s
with the fix:
uretprobe-nop : 0.860 ± 0.001M/s < 10% speed up
uretprobe-push : 0.818 ± 0.001M/s < 10% speed up
uretprobe-ret : 0.578 ± 0.000M/s < 7% speed up
The performance test spawns a thread that runs loop which triggers
uprobe with attached bpf program that increments the counter that
gets printed in results above.
The uprobe (and uretprobe) kind is determined by which instruction
is being patched with breakpoint instruction. That's also important
for uretprobes, because uprobe is installed for each uretprobe.
The performance test is part of bpf selftests:
tools/testing/selftests/bpf/run_bench_uprobes.sh
Note at the moment uretprobe syscall is supported only for native
64-bit process, compat process still uses standard breakpoint.
Note that when shadow stack is enabled the uretprobe syscall returns
via iret, which is slower than return via sysret, but won't cause the
shadow stack violation.
Link: https://lore.kernel.org/all/20240611112158.40795-4-jolsa@kernel.org/
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Currently the application with enabled shadow stack will crash
if it sets up return uprobe. The reason is the uretprobe kernel
code changes the user space task's stack, but does not update
shadow stack accordingly.
Adding new functions to update values on shadow stack and using
them in uprobe code to keep shadow stack in sync with uretprobe
changes to user stack.
Link: https://lore.kernel.org/all/20240611112158.40795-2-jolsa@kernel.org/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 488af8ea71 ("x86/shstk: Wire in shadow stack interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Some AMD Zen 4 processors support a new feature FAST CPPC which
allows for a faster CPPC loop due to internal architectural
enhancements. The goal of this faster loop is higher performance
at the same power consumption.
Reference:
See the page 99 of PPR for AMD Family 19h Model 61h rev.B1, docID 56713
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Xiaojian Du <Xiaojian.Du@amd.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Instead of making increasingly complicated ALTERNATIVE_n()
implementations, use a nested alternative expression.
The only difference between:
ALTERNATIVE_2(oldinst, newinst1, flag1, newinst2, flag2)
and
ALTERNATIVE(ALTERNATIVE(oldinst, newinst1, flag1),
newinst2, flag2)
is that the outer alternative can add additional padding when the inner
alternative is the shorter one, which then results in
alt_instr::instrlen being inconsistent.
However, this is easily remedied since the alt_instr entries will be
consecutive and it is trivial to compute the max(alt_instr::instrlen) at
runtime while patching.
Specifically, after this the ALTERNATIVE_2 macro, after CPP expansion
(and manual layout), looks like this:
.macro ALTERNATIVE_2 oldinstr, newinstr1, ft_flags1, newinstr2, ft_flags2
740:
740: \oldinstr ;
741: .skip -(((744f-743f)-(741b-740b)) > 0) * ((744f-743f)-(741b-740b)),0x90 ;
742: .pushsection .altinstructions,"a" ;
altinstr_entry 740b,743f,\ft_flags1,742b-740b,744f-743f ;
.popsection ;
.pushsection .altinstr_replacement,"ax" ;
743: \newinstr1 ;
744: .popsection ; ;
741: .skip -(((744f-743f)-(741b-740b)) > 0) * ((744f-743f)-(741b-740b)),0x90 ;
742: .pushsection .altinstructions,"a" ;
altinstr_entry 740b,743f,\ft_flags2,742b-740b,744f-743f ;
.popsection ;
.pushsection .altinstr_replacement,"ax" ;
743: \newinstr2 ;
744: .popsection ;
.endm
The only label that is ambiguous is 740, however they all reference the
same spot, so that doesn't matter.
NOTE: obviously only @oldinstr may be an alternative; making @newinstr
an alternative would mean patching .altinstr_replacement which very
likely isn't what is intended, also the labels will be confused in that
case.
[ bp: Debug an issue where it would match the wrong two insns and
and consider them nested due to the same signed offsets in the
.alternative section and use instr_va() to compare the full virtual
addresses instead.
- Use new labels to denote that the new, nested
alternatives are being used when staring at preprocessed output.
- Use the %c constraint everywhere instead of %P and document the
difference for future reference. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230628104952.GA2439977@hirez.programming.kicks-ass.net
The SVSM Calling Area (CA) is used to communicate between Linux and the
SVSM. Since the firmware supplied CA for the BSP is likely to be in
reserved memory, switch off that CA to a kernel provided CA so that access
and use of the CA is available during boot. The CA switch is done using
the SVSM core protocol SVSM_CORE_REMAP_CA call.
An SVSM call is executed by filling out the SVSM CA and setting the proper
register state as documented by the SVSM protocol. The SVSM is invoked by
by requesting the hypervisor to run VMPL0.
Once it is safe to allocate/reserve memory, allocate a CA for each CPU.
After allocating the new CAs, the BSP will switch from the boot CA to the
per-CPU CA. The CA for an AP is identified to the SVSM when creating the
VMSA in preparation for booting the AP.
[ bp: Heavily simplify svsm_issue_call() asm, other touchups. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/fa8021130bcc3bcf14d722a25548cb0cdf325456.1717600736.git.thomas.lendacky@amd.com
During early boot phases, check for the presence of an SVSM when running
as an SEV-SNP guest.
An SVSM is present if not running at VMPL0 and the 64-bit value at offset
0x148 into the secrets page is non-zero. If an SVSM is present, save the
SVSM Calling Area address (CAA), located at offset 0x150 into the secrets
page, and set the VMPL level of the guest, which should be non-zero, to
indicate the presence of an SVSM.
[ bp: Touchups. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/9d3fe161be93d4ea60f43c2a3f2c311fe708b63b.1717600736.git.thomas.lendacky@amd.com
pseudo_lock_region_init() and rdtgroup_cbm_to_size() open code a search for
details of a particular cache level.
Replace with get_cpu_cacheinfo_level().
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240610003927.341707-5-tony.luck@intel.com
AMD Zen-based systems use a System Management Network (SMN) that
provides access to implementation-specific registers.
SMN accesses are done indirectly through an index/data pair in PCI
config space. The PCI config access may fail and return an error code.
This would prevent the "read" value from being updated.
However, the PCI config access may succeed, but the return value may be
invalid. This is in similar fashion to PCI bad reads, i.e. return all
bits set.
Most systems will return 0 for SMN addresses that are not accessible.
This is in line with AMD convention that unavailable registers are
Read-as-Zero/Writes-Ignored.
However, some systems will return a "PCI Error Response" instead. This
value, along with an error code of 0 from the PCI config access, will
confuse callers of the amd_smn_read() function.
Check for this condition, clear the return value, and set a proper error
code.
Fixes: ddfe43cdc0 ("x86/amd_nb: Add SMN and Indirect Data Fabric access for AMD Fam17h")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230403164244.471141-1-yazen.ghannam@amd.com
The call to cc_platform_has() triggers a fault and system crash if call depth
tracking is active because the GS segment has been reset by load_segments() and
GS_BASE is now 0 but call depth tracking uses per-CPU variables to operate.
Call cc_platform_has() earlier in the function when GS is still valid.
[ bp: Massage. ]
Fixes: 5d8213864a ("x86/retbleed: Add SKL return thunk")
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20240603083036.637-1-bp@kernel.org
convert_art_to_tsc() and convert_art_ns_to_tsc() interfaces are no
longer required. The conversion is now handled by the core code.
Signed-off-by: Lakshmi Sowjanya D <lakshmi.sowjanya.d@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240513103813.5666-9-lakshmi.sowjanya.d@intel.com
The core code provides a new mechanism to allow conversion between ART and
TSC. This allows to replace the x86 specific ART/TSC conversion functions.
Prepare for removal by filling in the base clock conversion information for
ART and associating the base clock to the TSC clocksource.
The existing conversion functions will be removed once the usage sites are
converted over to the new model.
[ tglx: Massaged change log ]
Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Christopher S. Hall <christopher.s.hall@intel.com>
Signed-off-by: Christopher S. Hall <christopher.s.hall@intel.com>
Signed-off-by: Lakshmi Sowjanya D <lakshmi.sowjanya.d@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240513103813.5666-3-lakshmi.sowjanya.d@intel.com
- Fix topology parsing regression on older CPUs in the
new AMD/Hygon parser
- Fix boot crash on odd Intel Quark and similar CPUs that
do not fill out cpuinfo_x86::x86_clflush_size and zero out
cpuinfo_x86::x86_cache_alignment as a result. Provide
32 bytes as a general fallback value.
- Fix topology enumeration on certain rare CPUs where the
BIOS locks certain CPUID leaves and the kernel unlocked
them late, which broke with the new topology parsing code.
Factor out this unlocking logic and move it earlier
in the parsing sequence.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZcHdcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i5tQ/9G1ckVgGEKvDPwGcUi9Db9+2UzsWfB0og
kUYgBJDq/sp0ZXPj/RB3M9h3YKmmsOuL4ZUJz3hrqQt1MqEx7eVNUbFuFRoE2ojx
MimGI/L1pvBrJb9grpULrMX8aDND6hC1OQYOrUEN/yOTPxth77fGJIhcc/plSbAZ
po1S12uOONxX1EvKlS/B0k6zYqBUWYTzkMog/YSa/TjXez9A/yJqt5dcNAyEdSrq
EbjSF/7warhFGmiuFDC2z8rvnrwZ/qT5cOlkHkHs8JSigDchYT/gctWv2bQPCavS
Nw/Aoue7TfxYu9F2H0PaqcA3efSNKmfcuozX0PNLswMGrBc4HoVoVdu3ldigOPhm
lj4M0zEPkzRFuGvrBdsbm+oewzDOK+jr+QYyy0R+HU48vz0RpoVKpWfOqI9fjfQt
9m2nuKLLd4mOEwnRLtCdfQzggksIJoV0soHH6yR+32cqqb9t82tICF5caPsdQYzE
/zH/onXkaiz5Rn4vL7em7vcAE1RvL97b8iU435Hnta6Lboi3FxJepxGt5ZRsGCZQ
ukV5iEAkRQRNjrvaC2QT8jNmBQ0f73UBixn0iB7CKtGReteP3gn4svHfvkhVlZVN
Qpw2HvCm+LlpX7+U8EvzzqETNg5CYY46pE4nUNsHr+/zQEFFOER6MNW5rJDDMWAl
QdVvI4HhS8Y=
=ugOt
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2024-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Miscellaneous topology parsing fixes:
- Fix topology parsing regression on older CPUs in the new AMD/Hygon
parser
- Fix boot crash on odd Intel Quark and similar CPUs that do not fill
out cpuinfo_x86::x86_clflush_size and zero out
cpuinfo_x86::x86_cache_alignment as a result.
Provide 32 bytes as a general fallback value.
- Fix topology enumeration on certain rare CPUs where the BIOS locks
certain CPUID leaves and the kernel unlocked them late, which broke
with the new topology parsing code. Factor out this unlocking logic
and move it earlier in the parsing sequence"
* tag 'x86-urgent-2024-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/topology/intel: Unlock CPUID before evaluating anything
x86/cpu: Provide default cache line size if not enumerated
x86/topology/amd: Evaluate SMT in CPUID leaf 0x8000001e only on family 0x17 and greater
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZcGuARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iDJw//YwjUCBQTmzKDgahXy8I1BX4ndcIrS/FW
eSUN/17zYac9sDe3db6Exr+PddoLYIc2vtQ3AQFtuZrYEhGoItNVIoDedwrSvDeC
NHOUKTgI6vO/eGCINUVotvA1Rzgcl7Bq04YPGXmIzMyNCsVlbBzo/vW4OiNNHaSw
iP0cI6D/dHcWr94uYN9vnBO1G/A0ixDhM3KiZCJwib5rw60rDeoerdScH34IRPlE
Wfn6jFD6b6Z5fUjPvbizzD8T+MI85AIasznB9TnkJOuKlKW0pVJNU9HVqmEvV/Yd
JTtDUekM5SNuL5PFyn0pkVq3ZYIxeY0LU7afFVFwgZ4t4VwQVeyobvjX7a2S2r3l
alCFaFE2j/CHcUYyAmXPON8tcN98pupSnPSsv2oYKErUrEFFLEwTKdQMzNn5Jfqz
fWAwD4h+WH+2y9HZYs0I34a2ssbcU3l5TdDFPHpNxa4Zmt0eQxN7ihelDWKECZTk
7oH+lZYoHySG4KxL2ppMRAcHOKDB61UJnlQvGVYl6QpnrrnxmR0kwkP+OQZPQVhH
DEgues/lGYqqyLOIZnq+2ciTjSmRQCkhfRdSC+btiMx6hXuBVhlUOW4YZoRyPUwp
31I/XAOchcqee1Wt4+Z1dqhDDtRAzmau04xXZtq5GkgGjavpSbzAFCRCpfhCh2xh
plMLErWFk5E=
=tTwc
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2024-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Export a symbol to make life easier for instrumentation/debugging"
* tag 'sched-urgent-2024-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/x86: Export 'percpu arch_freq_scale'
make W=1 C=1 warns:
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kernel/cpu/mce/mce-inject.o
Add the missing MODULE_DESCRIPTION().
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240530-md-x86-mce-inject-v1-1-2a9dc998f709@quicinc.com
Intel CPUs have a MSR bit to limit CPUID enumeration to leaf two. If
this bit is set by the BIOS then CPUID evaluation including topology
enumeration does not work correctly as the evaluation code does not try
to analyze any leaf greater than two.
This went unnoticed before because the original topology code just
repeated evaluation several times and managed to overwrite the initial
limited information with the correct one later. The new evaluation code
does it once and therefore ends up with the limited and wrong
information.
Cure this by unlocking CPUID right before evaluating anything which
depends on the maximum CPUID leaf being greater than two instead of
rereading stuff after unlock.
Fixes: 22d63660c3 ("x86/cpu: Use common topology code for Intel")
Reported-by: Peter Schneider <pschneider1968@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Peter Schneider <pschneider1968@googlemail.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/fd3f73dc-a86f-4bcf-9c60-43556a21eb42@googlemail.com
Clean up the top level include/drm directory by grouping all the Intel
specific files under a common subdirectory.
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/a19cebc0f03588b9627dcaaebe69a9fef28c27f0.1717075103.git.jani.nikula@intel.com