2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

20103 Commits

Author SHA1 Message Date
Ravi Bangoria
350afa8a11 x86/split_lock: Move Split and Bus lock code to a dedicated file
Bus Lock Detect functionality on AMD platforms works identical to Intel.

Move split_lock and bus_lock specific code from intel.c to a dedicated
file so that it can be compiled and supported on non-Intel platforms.

Also, introduce CONFIG_X86_BUS_LOCK_DETECT, make it dependent on
CONFIG_CPU_SUP_INTEL and add compilation dependency of the new bus_lock.c
file on CONFIG_X86_BUS_LOCK_DETECT.

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/all/20240808062937.1149-2-ravi.bangoria@amd.com
2024-08-08 18:02:15 +02:00
Dmitry Vyukov
ae94b263f5 x86: Ignore stack unwinding in KCOV
Stack unwinding produces large amounts of uninteresting coverage.
It's called from KASAN kmalloc/kfree hooks, fault injection, etc.
It's not particularly useful and is not a function of system call args.
Ignore that code.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/eaf54b8634970b73552dcd38bf9be6ef55238c10.1718092070.git.dvyukov@google.com
2024-08-08 17:36:35 +02:00
Andi Kleen
919f18f961 x86/mtrr: Check if fixed MTRRs exist before saving them
MTRRs have an obsolete fixed variant for fine grained caching control
of the 640K-1MB region that uses separate MSRs. This fixed variant has
a separate capability bit in the MTRR capability MSR.

So far all x86 CPUs which support MTRR have this separate bit set, so it
went unnoticed that mtrr_save_state() does not check the capability bit
before accessing the fixed MTRR MSRs.

Though on a CPU that does not support the fixed MTRR capability this
results in a #GP.  The #GP itself is harmless because the RDMSR fault is
handled gracefully, but results in a WARN_ON().

Add the missing capability check to prevent this.

Fixes: 2b1f6278d7 ("[PATCH] x86: Save the MTRRs of the BSP before booting an AP")
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240808000244.946864-1-ak@linux.intel.com
2024-08-08 17:03:12 +02:00
Chen Yu
e639222a51 x86/paravirt: Fix incorrect virt spinlock setting on bare metal
The kernel can change spinlock behavior when running as a guest. But this
guest-friendly behavior causes performance problems on bare metal.

The kernel uses a static key to switch between the two modes.

In theory, the static key is enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior or paravirt spinlock).

A performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.

Bisect points to commit ce0a1b608b ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is
disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which decreases
the performance on bare metal.

Set the default value of virt_spin_lock_key to false. If booting in a VM,
enable this key. Later during the VM initialization, if other
high-efficient spinlock is preferred (e.g. paravirt-spinlock), or the user
wants the native qspinlock (via nopvspin boot commandline), the
virt_spin_lock_key is disabled accordingly.

This results in the following decision matrix:

X86_FEATURE_HYPERVISOR         Y    Y       Y     N
CONFIG_PARAVIRT_SPINLOCKS      Y    Y       N     Y/N
PV spinlock                    Y    N       N     Y/N

virt_spin_lock_key             N    Y/N     Y     N

Fixes: ce0a1b608b ("x86/paravirt: Silence unused native_pv_lock_init() function warning")
Reported-by: Prem Nath Dey <prem.nath.dey@intel.com>
Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Suggested-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240806112207.29792-1-yu.c.chen@intel.com
2024-08-07 20:04:38 +02:00
Zhiquan Li
ab84ba647f x86/acpi: Remove __ro_after_init from acpi_mp_wake_mailbox
On a platform using the "Multiprocessor Wakeup Structure"[1] to startup
secondary CPUs the control processor needs to memremap() the physical
address of the MP Wakeup Structure mailbox to the variable
acpi_mp_wake_mailbox, which holds the virtual address of mailbox.

To wake up the AP the control processor writes the APIC ID of AP, the
wakeup vector and the ACPI_MP_WAKE_COMMAND_WAKEUP command into the mailbox.

Current implementation doesn't consider the case which restricts boot time
CPU bringup to 1 with the kernel parameter "maxcpus=1" and brings other
CPUs online later from user space as it sets acpi_mp_wake_mailbox to
read-only after init.  So when the first AP is tried to brought online
after init, the attempt to update the variable results in a kernel panic.

The memremap() call that initializes the variable cannot be moved into
acpi_parse_mp_wake() because memremap() is not functional at that point in
the boot process. Also as the APs might never be brought up, keep the
memremap() call in acpi_wakeup_cpu() so that the operation only takes place
when needed.

Fixes: 24dd05da8c ("x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init")
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20240805103531.1230635-1-zhiquan1.li@intel.com
2024-08-07 20:04:38 +02:00
Thomas Gleixner
62e303e346 x86/ioapic: Cleanup remaining coding style issues
Add missing new lines and reorder variable definitions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155441.158662179@linutronix.de
2024-08-07 18:13:29 +02:00
Thomas Gleixner
966e09b186 x86/ioapic: Cleanup line breaks
80 character limit is history.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155441.095653193@linutronix.de
2024-08-07 18:13:29 +02:00
Thomas Gleixner
4bcfdf76d7 x86/ioapic: Cleanup bracket usage
Add brackets around if/for constructs as required by coding style or remove
pointless line breaks to make it true single line statements which do not
require brackets.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155441.032045616@linutronix.de
2024-08-07 18:13:29 +02:00
Thomas Gleixner
75d449402b x86/ioapic: Cleanup comments
Use proper comment styles and shrink comments to their scope where
applicable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.969619978@linutronix.de
2024-08-07 18:13:28 +02:00
Thomas Gleixner
ee64510fb9 x86/ioapic: Move replace_pin_at_irq_node() to the call site
It's only used by check_timer().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.906636514@linutronix.de
2024-08-07 18:13:28 +02:00
Thomas Gleixner
1ee0aa8285 x86/mpparse: Cleanup apic_printk()s
Use the new apic_pr_verbose() helper.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.779537108@linutronix.de
2024-08-07 18:13:28 +02:00
Thomas Gleixner
54cd3795b4 x86/ioapic: Cleanup guarded debug printk()s
Cleanup the APIC printk()s which are inside of a apic verbosity guarded
region by using apic_dbg() for the KERN_DEBUG level prints.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.714763708@linutronix.de
2024-08-07 18:13:28 +02:00
Thomas Gleixner
f47998da39 x86/ioapic: Cleanup apic_printk()s
Replace apic_printk($LEVEL) with the corresponding apic_pr_*() helpers and
use pr_info() for APIC_QUIET as that is always printed so the indirection
is pointless noise.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.652239904@linutronix.de
2024-08-07 18:13:28 +02:00
Thomas Gleixner
ac1c9fc1b5 x86/apic: Cleanup apic_printk()s
Use the new apic_pr_*() helpers and cleanup the apic_printk() maze.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.589821068@linutronix.de
2024-08-07 18:13:28 +02:00
Thomas Gleixner
ed57538b85 x86/ioapic: Use guard() for locking where applicable
KISS rules!

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.464227224@linutronix.de
2024-08-07 18:13:27 +02:00
Thomas Gleixner
d8c76d0167 x86/ioapic: Cleanup structs
Make them conforming to the TIP coding style guide.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.402005874@linutronix.de
2024-08-07 18:13:27 +02:00
Thomas Gleixner
6daceb891d x86/ioapic: Mark mp_alloc_timer_irq() __init
Only invoked from check_timer() which is __init too. Cleanup the variable
declaration while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20240802155440.339321108@linutronix.de
2024-08-07 18:13:27 +02:00
Thomas Gleixner
830802a0fe x86/ioapic: Handle allocation failures gracefully
Breno observed panics when using failslab under certain conditions during
runtime:

   can not alloc irq_pin_list (-1,0,20)
   Kernel panic - not syncing: IO-APIC: failed to add irq-pin. Can not proceed

   panic+0x4e9/0x590
   mp_irqdomain_alloc+0x9ab/0xa80
   irq_domain_alloc_irqs_locked+0x25d/0x8d0
   __irq_domain_alloc_irqs+0x80/0x110
   mp_map_pin_to_irq+0x645/0x890
   acpi_register_gsi_ioapic+0xe6/0x150
   hpet_open+0x313/0x480

That's a pointless panic which is a leftover of the historic IO/APIC code
which panic'ed during early boot when the interrupt allocation failed.

The only place which might justify panic is the PIT/HPET timer_check() code
which tries to figure out whether the timer interrupt is delivered through
the IO/APIC. But that code does not require to handle interrupt allocation
failures. If the interrupt cannot be allocated then timer delivery fails
and it either panics due to that or falls back to legacy mode.

Cure this by removing the panic wrapper around __add_pin_to_irq_node() and
making mp_irqdomain_alloc() aware of the failure condition and handle it as
any other failure in this function gracefully.

Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Breno Leitao <leitao@debian.org>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/all/ZqfJmUF8sXIyuSHN@gmail.com
Link: https://lore.kernel.org/all/20240802155440.275200843@linutronix.de
2024-08-07 18:13:27 +02:00
Gatlin Newhouse
7424fc6b86 x86/traps: Enable UBSAN traps on x86
Currently ARM64 extracts which specific sanitizer has caused a trap via
encoded data in the trap instruction. Clang on x86 currently encodes the
same data in the UD1 instruction but x86 handle_bug() and
is_valid_bugaddr() currently only look at UD2.

Bring x86 to parity with ARM64, similar to commit 25b84002af ("arm64:
Support Clang UBSAN trap codes for better reporting"). See the llvm
links for information about the code generation.

Enable the reporting of UBSAN sanitizer details on x86 compiled with clang
when CONFIG_UBSAN_TRAP=y by analysing UD1 and retrieving the type immediate
which is encoded by the compiler after the UD1.

[ tglx: Simplified it by moving the printk() into handle_bug() ]

Signed-off-by: Gatlin Newhouse <gatlin.newhouse@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20240724000206.451425-1-gatlin.newhouse@gmail.com
Link: https://github.com/llvm/llvm-project/commit/c5978f42ec8e9#diff-bb68d7cd885f41cfc35843998b0f9f534adb60b415f647109e597ce448e92d9f
Link: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86InstrSystem.td#L27
2024-08-06 13:42:40 +02:00
Tao Liu
5760929f65 x86/kexec: Add EFI config table identity mapping for kexec kernel
A kexec kernel boot failure is sometimes observed on AMD CPUs due to an
unmapped EFI config table array.  This can be seen when "nogbpages" is on
the kernel command line, and has been observed as a full BIOS reboot rather
than a successful kexec.

This was also the cause of reported regressions attributed to Commit
7143c5f4cf20 ("x86/mm/ident_map: Use gbpages only where full GB page should
be mapped.") which was subsequently reverted.

To avoid this page fault, explicitly include the EFI config table array in
the kexec identity map.

Further explanation:

The following 2 commits caused the EFI config table array to be
accessed when enabling sev at kernel startup.

    commit ec1c66af3a ("x86/compressed/64: Detect/setup SEV/SME features
                          earlier during boot")
    commit c01fce9cef ("x86/compressed: Add SEV-SNP feature
                          detection/setup")

This is in the code that examines whether SEV should be enabled or not, so
it can even affect systems that are not SEV capable.

This may result in a page fault if the EFI config table array's address is
unmapped. Since the page fault occurs before the new kernel establishes its
own identity map and page fault routines, it is unrecoverable and kexec
fails.

Most often, this problem is not seen because the EFI config table array
gets included in the map by the luck of being placed at a memory address
close enough to other memory areas that *are* included in the map created
by kexec.

Both the "nogbpages" command line option and the "use gpbages only where
full GB page should be mapped" change greatly reduce the chance of being
included in the map by luck, which is why the problem appears.

Signed-off-by: Tao Liu <ltao@redhat.com>
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavin Joseph <me@pavinjoseph.com>
Tested-by: Sarah Brofeldt <srhb@dbc.dk>
Tested-by: Eric Hagberg <ehagberg@gmail.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/all/20240717213121.3064030-2-steve.wahl@hpe.com
2024-08-05 16:09:31 +02:00
Michael Kelley
8fcc514809 x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
A Linux guest on Hyper-V gets the TSC frequency from a synthetic MSR, if
available. In this case, set X86_FEATURE_TSC_KNOWN_FREQ so that Linux
doesn't unnecessarily do refined TSC calibration when setting up the TSC
clocksource.

With this change, a message such as this is no longer output during boot
when the TSC is used as the clocksource:

[    1.115141] tsc: Refined TSC clocksource calibration: 2918.408 MHz

Furthermore, the guest and host will have exactly the same view of the
TSC frequency, which is important for features such as the TSC deadline
timer that are emulated by the Hyper-V host.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Link: https://lore.kernel.org/r/20240606025559.1631-1-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240606025559.1631-1-mhklinux@outlook.com>
2024-08-02 23:47:30 +00:00
Paul E. McKenney
e7ff4ebffe x86/tsc: Check for sockets instead of CPUs to make code match comment
The unsynchronized_tsc() eventually checks num_possible_cpus(), and if the
system is non-Intel and the number of possible CPUs is greater than one,
assumes that TSCs are unsynchronized.  This despite the comment saying
"assume multi socket systems are not synchronized", that is, socket rather
than CPU.  This behavior was preserved by commit 8fbbc4b45c ("x86: merge
tsc_init and clocksource code") and by the previous relevant commit
7e69f2b1ea ("clocksource: Remove the update callback").

The clocksource drivers were added by commit 5d0cf410e9 ("Time: i386
Clocksource Drivers") back in 2006, and the comment still said "socket"
rather than "CPU".

Therefore, bravely (and perhaps foolishly) make the code match the
comment.

Note that it is possible to bypass both code and comment by booting
with tsc=reliable, but this also disables the clocksource watchdog,
which is undesirable when trust in the TSC is strictly limited.

Reported-by: Zhengxu Chen <zhxchen17@meta.com>
Reported-by: Danielle Costantino <dcostantino@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-5-paulmck@kernel.org
2024-08-02 18:38:07 +02:00
David Woodhouse
531b2ca0a9 clockevents/drivers/i8253: Fix stop sequence for timer 0
According to the data sheet, writing the MODE register should stop the
counter (and thus the interrupts). This appears to work on real hardware,
at least modern Intel and AMD systems. It should also work on Hyper-V.

However, on some buggy virtual machines the mode change doesn't have any
effect until the counter is subsequently loaded (or perhaps when the IRQ
next fires).

So, set MODE 0 and then load the counter, to ensure that those buggy VMs
do the right thing and the interrupts stop. And then write MODE 0 *again*
to stop the counter on compliant implementations too.

Apparently, Hyper-V keeps firing the IRQ *repeatedly* even in mode zero
when it should only happen once, but the second MODE write stops that too.

Userspace test program (mostly written by tglx):
=====
 #include <stdio.h>
 #include <unistd.h>
 #include <stdlib.h>
 #include <stdint.h>
 #include <sys/io.h>

static __always_inline void __out##bwl(type value, uint16_t port)	\
{									\
	asm volatile("out" #bwl " %" #bw "0, %w1"			\
		     : : "a"(value), "Nd"(port));			\
}									\
									\
static __always_inline type __in##bwl(uint16_t port)			\
{									\
	type value;							\
	asm volatile("in" #bwl " %w1, %" #bw "0"			\
		     : "=a"(value) : "Nd"(port));			\
	return value;							\
}

BUILDIO(b, b, uint8_t)

 #define inb __inb
 #define outb __outb

 #define PIT_MODE	0x43
 #define PIT_CH0	0x40
 #define PIT_CH2	0x42

static int is8254;

static void dump_pit(void)
{
	if (is8254) {
		// Latch and output counter and status
		outb(0xC2, PIT_MODE);
		printf("%02x %02x %02x\n", inb(PIT_CH0), inb(PIT_CH0), inb(PIT_CH0));
	} else {
		// Latch and output counter
		outb(0x0, PIT_MODE);
		printf("%02x %02x\n", inb(PIT_CH0), inb(PIT_CH0));
	}
}

int main(int argc, char* argv[])
{
	int nr_counts = 2;

	if (argc > 1)
		nr_counts = atoi(argv[1]);

	if (argc > 2)
		is8254 = 1;

	if (ioperm(0x40, 4, 1) != 0)
		return 1;

	dump_pit();

	printf("Set oneshot\n");
	outb(0x38, PIT_MODE);
	outb(0x00, PIT_CH0);
	outb(0x0F, PIT_CH0);

	dump_pit();
	usleep(1000);
	dump_pit();

	printf("Set periodic\n");
	outb(0x34, PIT_MODE);
	outb(0x00, PIT_CH0);
	outb(0x0F, PIT_CH0);

	dump_pit();
	usleep(1000);
	dump_pit();
	dump_pit();
	usleep(100000);
	dump_pit();
	usleep(100000);
	dump_pit();

	printf("Set stop (%d counter writes)\n", nr_counts);
	outb(0x30, PIT_MODE);
	while (nr_counts--)
		outb(0xFF, PIT_CH0);

	dump_pit();
	usleep(100000);
	dump_pit();
	usleep(100000);
	dump_pit();

	printf("Set MODE 0\n");
	outb(0x30, PIT_MODE);

	dump_pit();
	usleep(100000);
	dump_pit();
	usleep(100000);
	dump_pit();

	return 0;
}
=====

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhkelley@outlook.com>
Link: https://lore.kernel.org/all/20240802135555.564941-2-dwmw2@infradead.org
2024-08-02 18:27:05 +02:00
David Woodhouse
70e6b7d9ae x86/i8253: Disable PIT timer 0 when not in use
Leaving the PIT interrupt running can cause noticeable steal time for
virtual guests. The VMM generally has a timer which toggles the IRQ input
to the PIC and I/O APIC, which takes CPU time away from the guest. Even
on real hardware, running the counter may use power needlessly (albeit
not much).

Make sure it's turned off if it isn't going to be used.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhkelley@outlook.com>
Link: https://lore.kernel.org/all/20240802135555.564941-1-dwmw2@infradead.org
2024-08-02 18:27:05 +02:00
Aruna Ramakrishna
d10b554919 x86/pkeys: Restore altstack access in sigreturn()
A process can disable access to the alternate signal stack by not
enabling the altstack's PKEY in the PKRU register.

Nevertheless, the kernel updates the PKRU temporarily for signal
handling. However, in sigreturn(), restore_sigcontext() will restore the
PKRU to the user-defined PKRU value.

This will cause restore_altstack() to fail with a SIGSEGV as it needs read
access to the altstack which is prohibited by the user-defined PKRU value.

Fix this by restoring altstack before restoring PKRU.

Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802061318.2140081-5-aruna.ramakrishna@oracle.com
2024-08-02 14:12:21 +02:00
Aruna Ramakrishna
70044df250 x86/pkeys: Update PKRU to enable all pkeys before XSAVE
If the alternate signal stack is protected by a different PKEY than the
current execution stack, copying XSAVE data to the sigaltstack will fail
if its PKEY is not enabled in the PKRU register.

It's unknown which pkey was used by the application for the altstack, so
enable all PKEYS before XSAVE.

But this updated PKRU value is also pushed onto the sigframe, which
means the register value restored from sigcontext will be different from
the user-defined one, which is incorrect.

Fix that by overwriting the PKRU value on the sigframe with the original,
user-defined PKRU.

Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802061318.2140081-4-aruna.ramakrishna@oracle.com
2024-08-02 14:12:21 +02:00
Aruna Ramakrishna
84ee6e8d19 x86/pkeys: Add helper functions to update PKRU on the sigframe
In the case where a user thread sets up an alternate signal stack protected
by the default PKEY (i.e. PKEY 0), while the thread's stack is protected by
a non-zero PKEY, both these PKEYS have to be enabled in the PKRU register
for the signal to be delivered to the application correctly. However, the
PKRU value restored after handling the signal must not enable this extra
PKEY (i.e. PKEY 0) - i.e., the PKRU value in the sigframe has to be
overwritten with the user-defined value.

Add helper functions that will update PKRU value in the sigframe after
XSAVE.

Note that sig_prepare_pkru() makes no assumption about which PKEY could
be used to protect the altstack (i.e. it may not be part of init_pkru),
and so enables all PKEYS.

No functional change.

Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802061318.2140081-3-aruna.ramakrishna@oracle.com
2024-08-02 14:12:21 +02:00
Aruna Ramakrishna
24cf2bc982 x86/pkeys: Add PKRU as a parameter in signal handling functions
Assume there's a multithreaded application that runs untrusted user
code. Each thread has its stack/code protected by a non-zero PKEY, and the
PKRU register is set up such that only that particular non-zero PKEY is
enabled. Each thread also sets up an alternate signal stack to handle
signals, which is protected by PKEY zero. The PKEYs man page documents that
the PKRU will be reset to init_pkru when the signal handler is invoked,
which means that PKEY zero access will be enabled.  But this reset happens
after the kernel attempts to push fpu state to the alternate stack, which
is not (yet) accessible by the kernel, which leads to a new SIGSEGV being
sent to the application, terminating it.

Enabling both the non-zero PKEY (for the thread) and PKEY zero in
userspace will not work for this use case. It cannot have the alt stack
writeable by all - the rationale here is that the code running in that
thread (using a non-zero PKEY) is untrusted and should not have access
to the alternate signal stack (that uses PKEY zero), to prevent the
return address of a function from being changed. The expectation is that
kernel should be able to set up the alternate signal stack and deliver
the signal to the application even if PKEY zero is explicitly disabled
by the application. The signal handler accessibility should not be
dictated by whatever PKRU value the thread sets up.

The PKRU register is managed by XSAVE, which means the sigframe contents
must match the register contents - which is not the case here. It's
required that the signal frame contains the user-defined PKRU value (so
that it is restored correctly from sigcontext) but the actual register must
be reset to init_pkru so that the alt stack is accessible and the signal
can be delivered to the application. It seems that the proper fix here
would be to remove PKRU from the XSAVE framework and manage it separately,
which is quite complicated. As a workaround, do this:

        orig_pkru = rdpkru();
        wrpkru(orig_pkru & init_pkru_value);
        xsave_to_user_sigframe();
        put_user(pkru_sigframe_addr, orig_pkru)

In preparation for writing PKRU to sigframe, pass PKRU as an additional
parameter down the call chain from get_sigframe().

No functional change.

Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802061318.2140081-2-aruna.ramakrishna@oracle.com
2024-08-02 14:12:20 +02:00
Thomas Gleixner
4436e6da00 Merge branch 'linus' into x86/mm
Bring x86 and selftests up to date
2024-08-02 14:10:55 +02:00
Yazen Ghannam
793aa4bf19 x86/mce: Use mce_prep_record() helpers for apei_smca_report_x86_error()
Current AMD systems can report MCA errors using the ACPI Boot Error
Record Table (BERT). The BERT entries for MCA errors will be an x86
Common Platform Error Record (CPER) with an MSR register context that
matches the MCAX/SMCA register space.

However, the BERT will not necessarily be processed on the CPU that
reported the MCA errors. Therefore, the correct CPU number needs to be
determined and the information saved in struct mce.

Use the newly defined mce_prep_record_*() helpers to get the correct
data.

Also, add an explicit check to verify that a valid CPU number was found
from the APIC ID search.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20240730182958.4117158-4-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-08-01 18:20:25 +02:00
Yazen Ghannam
f9bbb8ad0c x86/mce: Define mce_prep_record() helpers for common and per-CPU fields
Generally, MCA information for an error is gathered on the CPU that
reported the error. In this case, CPU-specific information from the
running CPU will be correct.

However, this will be incorrect if the MCA information is gathered while
running on a CPU that didn't report the error. One example is creating
an MCA record using mce_prep_record() for errors reported from ACPI.

Split mce_prep_record() so that there is a helper function to gather
common, i.e. not CPU-specific, information and another helper for
CPU-specific information.

Leave mce_prep_record() defined as-is for the common case when running
on the reporting CPU.

Get MCG_CAP in the global helper even though the register is per-CPU.
This value is not already cached per-CPU like other values. And it does
not assist with any per-CPU decoding or handling.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20240730182958.4117158-3-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-08-01 18:20:25 +02:00
Yazen Ghannam
5ad21a2497 x86/mce: Rename mce_setup() to mce_prep_record()
There is no MCE "setup" done in mce_setup(). Rather, this function initializes
and prepares an MCE record.

Rename the function to highlight what it does.

No functional change is intended.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20240730182958.4117158-2-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-08-01 18:20:24 +02:00
Borislav Petkov (AMD)
bf514327c3 x86/setup: Parse the builtin command line before merging
Commit in Fixes was added as a catch-all for cases where the cmdline is
parsed before being merged with the builtin one.

And promptly one issue appeared, see Link below. The microcode loader
really needs to parse it that early, but the merging happens later.

Reshuffling the early boot nightmare^W code to handle that properly would
be a painful exercise for another day so do the chicken thing and parse the
builtin cmdline too before it has been merged.

Fixes: 0c40b1c7a8 ("x86/setup: Warn when option parsing is done too early")
Reported-by: Mike Lothian <mike@fireburn.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240730152108.GAZqkE5Dfi9AuKllRw@fat_crate.local
Link: https://lore.kernel.org/r/20240722152330.GCZp55ck8E_FT4kPnC@fat_crate.local
2024-07-31 21:46:35 +02:00
Feng Tang
b4bac27931 x86/tsc: Use topology_max_packages() to get package number
Commit b50db7095f ("x86/tsc: Disable clocksource watchdog for TSC on
qualified platorms") was introduced to solve problem that sometimes TSC
clocksource is wrongly judged as unstable by watchdog like 'jiffies', HPET,
etc.

In it, the hardware package number is a key factor for judging whether to
disable the watchdog for TSC, and 'nr_online_nodes' was chosen due to, at
that time (kernel v5.1x), it is available in early boot phase before
registering 'tsc-early' clocksource, where all non-boot CPUs are not
brought up yet.

Dave and Rui pointed out there are many cases in which 'nr_online_nodes'
is cheated and not accurate, like:

 * SNC (sub-numa cluster) mode enabled
 * numa emulation (numa=fake=8 etc.)
 * numa=off
 * platforms with CPU-less HBM nodes, CPU-less Optane memory nodes.
 * 'maxcpus=' cmdline setup, where chopped CPUs could be onlined later
 * 'nr_cpus=', 'possible_cpus=' cmdline setup, where chopped CPUs can
   not be onlined after boot

The SNC case is the most user-visible case, as many CSP (Cloud Service
Provider) enable this feature in their server fleets. When SNC3 enabled, a
2 socket machine will appear to have 6 NUMA nodes, and get impacted by the
issue in reality.

Thomas' recent patchset of refactoring x86 topology code improves
topology_max_packages() greatly, by making it more accurate and available
in early boot phase, which works well in most of the above cases.

The only exceptions are 'nr_cpus=' and 'possible_cpus=' setup, which may
under-estimate the package number. As during topology setup, the boot CPU
iterates through all enumerated APIC IDs and either accepts or rejects the
APIC ID. For accepted IDs, it figures out which bits of the ID map to the
package number.  It tracks which package numbers have been seen in a
bitmap.  topology_max_packages() just returns the number of bits set in
that bitmap.

'nr_cpus=' and 'possible_cpus=' can cause more APIC IDs to be rejected and
can artificially lower the number of bits in the package bitmap and thus
topology_max_packages().  This means that, for example, a system with 8
physical packages might reject all the CPUs on 6 of those packages and be
left with only 2 packages and 2 bits set in the package bitmap. It needs
the TSC watchdog, but would disable it anyway.  This isn't ideal, but it
only happens for debug-oriented options. This is fixable by tracking the
package numbers for rejected CPUs.  But it's not worth the trouble for
debugging.

So use topology_max_packages() to replace nr_online_nodes().

Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/all/20240729021202.180955-1-feng.tang@intel.com
Closes: https://lore.kernel.org/lkml/a4860054-0f16-6513-f121-501048431086@intel.com/
2024-07-31 21:12:09 +02:00
Perry Yuan
bf5641eccf x86/CPU/AMD: Add models 0x60-0x6f to the Zen5 range
Add some new Zen5 models for the 0x1A family.

  [ bp: Merge the 0x60 and 0x70 ranges. ]

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240729064626.24297-1-bp@kernel.org
2024-07-30 15:22:52 +02:00
Breno Leitao
225f2bd064 x86/bugs: Add a separate config for GDS
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some
mitigations have entries in Kconfig, and they could be modified, while others
mitigations do not have Kconfig entries, and could not be controlled at build
time.

Create a new kernel config that allows GDS to be completely disabled,
similarly to the "gather_data_sampling=off" or "mitigations=off" kernel
command-line.

Now, there are two options for GDS mitigation:

* CONFIG_MITIGATION_GDS=n -> Mitigation disabled (New)
* CONFIG_MITIGATION_GDS=y -> Mitigation enabled (GDS_MITIGATION_FULL)

Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-12-leitao@debian.org
2024-07-30 14:54:15 +02:00
Breno Leitao
03267a534b x86/bugs: Remove GDS Force Kconfig option
Remove the MITIGATION_GDS_FORCE Kconfig option, which aggressively disables
AVX as a mitigation for Gather Data Sampling (GDS) vulnerabilities. This
option is not widely used by distros.

While removing the Kconfig option, retain the runtime configuration ability
through the `gather_data_sampling=force` kernel parameter. This allows users
to still enable this aggressive mitigation if needed, without baking it into
the kernel configuration.

Simplify the kernel configuration while maintaining flexibility for runtime
mitigation choices.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Link: https://lore.kernel.org/r/20240729164105.554296-11-leitao@debian.org
2024-07-30 14:53:15 +02:00
Breno Leitao
b908cdab06 x86/bugs: Add a separate config for SSB
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.

Create an entry for the SSB CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-10-leitao@debian.org
2024-07-30 14:51:45 +02:00
Breno Leitao
72c70f480a x86/bugs: Add a separate config for Spectre V2
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.

Create an entry for the Spectre V2 CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-9-leitao@debian.org
2024-07-30 14:51:11 +02:00
Breno Leitao
a0b02e3fe3 x86/bugs: Add a separate config for SRBDS
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.

Create an entry for the SRBDS CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-8-leitao@debian.org
2024-07-30 14:49:53 +02:00
Breno Leitao
ca01c0d8d0 x86/bugs: Add a separate config for Spectre v1
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.

Create an entry for the Spectre v1 CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-7-leitao@debian.org
2024-07-30 14:49:28 +02:00
Breno Leitao
894e28857c x86/bugs: Add a separate config for RETBLEED
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.

Create an entry for the RETBLEED CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-6-leitao@debian.org
2024-07-30 14:48:54 +02:00
Breno Leitao
3a4ee4ff81 x86/bugs: Add a separate config for L1TF
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.

Create an entry for the L1TF CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-5-leitao@debian.org
2024-07-30 11:23:17 +02:00
Breno Leitao
163f9fe6b6 x86/bugs: Add a separate config for MMIO Stable Data
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.

Create an entry for the MMIO Stale data CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-4-leitao@debian.org
2024-07-30 10:56:20 +02:00
Breno Leitao
b8da0b33d3 x86/bugs: Add a separate config for TAA
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.

Create an entry for the TAA CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-3-leitao@debian.org
2024-07-30 10:36:16 +02:00
Breno Leitao
940455681d x86/bugs: Add a separate config for MDS
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated,
where some mitigations have entries in Kconfig, and they could be
modified, while others mitigations do not have Kconfig entries, and
could not be controlled at build time.

Create an entry for the MDS CPU mitigation under
CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable
it at compilation time.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240729164105.554296-2-leitao@debian.org
2024-07-30 10:17:36 +02:00
Borislav Petkov (AMD)
5343558a86 x86/microcode/AMD: Fix a -Wsometimes-uninitialized clang false positive
Initialize equiv_id in order to shut up:

  arch/x86/kernel/cpu/microcode/amd.c:714:6: warning: variable 'equiv_id' is \
  used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
        if (x86_family(bsp_cpuid_1_eax) < 0x17) {
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

because clang doesn't do interprocedural analysis for warnings to see
that this variable won't be used uninitialized.

Fixes: 94838d230a ("x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407291815.gJBST0P3-lkp@intel.com/
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-07-30 09:52:43 +02:00
Jonathan Cameron
0f7ced7d62 x86/aperfmperf: Fix deadlock on cpu_hotplug_lock
The broken patch results in a call to init_freq_invariance_cppc() in a CPU
hotplug handler in both the path for initially present CPUs and those
hotplugged later.  That function includes a one time call to
amd_set_max_freq_ratio() which in turn calls freq_invariance_enable() that has
a static_branch_enable() which takes the cpu_hotlug_lock which is already
held.

Avoid the deadlock by using static_branch_enable_cpuslocked() as the lock will
always be already held.  The equivalent path on Intel does not already hold
this lock, so take it around the call to freq_invariance_enable(), which
results in it being held over the call to register_syscall_ops, which looks to
be safe to do.

Fixes: c1385c1f0b ("ACPI: processor: Simplify initial onlining to use same path for cold and hotplug")
Closes: https://lore.kernel.org/all/CABXGCsPvqBfL5hQDOARwfqasLRJ_eNPBbCngZ257HOe=xbWDkA@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240729105504.2170-1-Jonathan.Cameron@huawei.com
2024-07-29 15:32:37 +02:00
Vignesh Balasubramanian
ba386777a3 x86/elf: Add a new FPU buffer layout info to x86 core files
Add a new .note section containing type, size, offset and flags of every
xfeature that is present.

This information will be used by debuggers to understand the XSAVE layout of
the machine where the core file has been dumped, and to read XSAVE registers,
especially during cross-platform debugging.

The XSAVE layouts of modern AMD and Intel CPUs differ, especially since
Memory Protection Keys and the AVX-512 features have been inculcated into
the AMD CPUs.

Since AMD never adopted (and hence never left room in the XSAVE layout for)
the Intel MPX feature, tools like GDB had assumed a fixed XSAVE layout
matching that of Intel (based on the XCR0 mask).

Hence, core dumps from AMD CPUs didn't match the known size for the XCR0 mask.
This resulted in GDB and other tools not being able to access the values of
the AVX-512 and PKRU registers on AMD CPUs.

To solve this, an interim solution has been accepted into GDB, and is already
a part of GDB 14, see

  https://sourceware.org/pipermail/gdb-patches/2023-March/198081.html.

But it depends on heuristics based on the total XSAVE register set size
and the XCR0 mask to infer the layouts of the various register blocks
for core dumps, and hence, is not a foolproof mechanism to determine the
layout of the XSAVE area.

Therefore, add a new core dump note in order to allow GDB/LLDB and other
relevant tools to determine the layout of the XSAVE area of the machine where
the corefile was dumped.

The new core dump note (which is being proposed as a per-process .note
section), NT_X86_XSAVE_LAYOUT (0x205) contains an array of structures.

Each structure describes an individual extended feature containing
offset, size and flags in this format:

  struct x86_xfeat_component {
         u32 type;
         u32 size;
         u32 offset;
         u32 flags;
  };

and in an independent manner, allowing for future extensions without depending
on hw arch specifics like CPUID etc.

  [ bp: Massage commit message, zap trailing whitespace. ]

Co-developed-by: Jini Susan George <jinisusan.george@amd.com>
Signed-off-by: Jini Susan George <jinisusan.george@amd.com>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Vignesh Balasubramanian <vigbalas@amd.com>
Link: https://lore.kernel.org/r/20240725161017.112111-2-vigbalas@amd.com
2024-07-29 10:45:43 +02:00
Borislav Petkov
94838d230a x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID
On Zen and newer, the family, model and stepping is part of the
microcode patch ID so that the equivalence table the driver has been
using, is not needed anymore.

So switch the driver to use that from now on.

The equivalence table in the microcode blob should still remain in case
there's need to pass some additional information to the kernel loader.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240725112037.GBZqI1BbUk1KMlOJ_D@fat_crate.local
2024-07-29 09:21:26 +02:00
Shyam Sundar S K
59c34008d3 x86/amd_nb: Add new PCI IDs for AMD family 1Ah model 60h
Add new PCI device IDs into the root IDs and miscellaneous IDs lists to
provide support for the latest generation of AMD 1Ah family 60h processor
models.

Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20240722092801.3480266-1-Shyam-sundar.S-k@amd.com
2024-07-29 08:53:49 +02:00
Joel Granados
78eb4ea25c sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24 20:59:29 +02:00
Linus Torvalds
91bd008d4e Probes updates for v6.11:
Uprobes:
 - x86/shstk: Make return uprobe work with shadow stack.
 - Add uretprobe syscall which speeds up the uretprobe 10-30% faster. This
   syscall is automatically used from user-space trampolines which are
   generated by the uretprobe. If this syscall is used by normal
   user program, it will cause SIGILL. Note that this is currently only
   implemented on x86_64.
   (This also has 2 fixes for adjusting the syscall number to avoid conflict
    with new *attrat syscalls.)
 - uprobes/perf: fix user stack traces in the presence of pending uretprobe.
   This corrects the uretprobe's trampoline address in the stacktrace with
   correct return address.
 - selftests/x86: Add a return uprobe with shadow stack test.
 - selftests/bpf: Add uretprobe syscall related tests.
   . test case for register integrity check.
   . test case with register changing case.
   . test case for uretprobe syscall without uprobes (expected to be failed).
   . test case for uretprobe with shadow stack.
 - selftests/bpf: add test validating uprobe/uretprobe stack traces
 - MAINTAINERS: Add uprobes entry. This does not specify the tree but to
   clarify who maintains and reviews the uprobes.
 
 Kprobes:
 - tracing/kprobes: Test case cleanups. Replace redundant WARN_ON_ONCE() +
   pr_warn() with WARN_ONCE() and remove unnecessary code from selftest.
 - tracing/kprobes: Add symbol counting check when module loads. This
   checks the uniqueness of the probed symbol on modules. The same check
   has already done for kernel symbols.
   (This also has a fix for build error with CONFIG_MODULES=n)
 
 Cleanup:
 - Add MODULE_DESCRIPTION() macros for fprobe and kprobe examples.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmaWYxwbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bsUgH/3JcSzDZujQWCZ1f4fJn
 QecvTFSYcCl6ck8+/3wm4EsgeCXIFOyPnoPc7k2Gm+l6Dlk1DKGV6wV4tuKFUq9X
 9mplcwoVA0Ln+EX9zv9v4s99yUGxcU9xjgC9XT7J52SvqYncPIi6dR0Z9wlJBmyd
 Bx3cZk+wSzCYaoqYngI2fKlzsEcYgDIP999fQPRi0HGzNZujc4xeJyjCTC/48yWO
 9kreRQq6wFdgRQTwMcR/fKPDKIGZQCU8jkXv5crVV5K3rNaBcwBmCJJMP8PzPU0V
 UQ0+8RZK+Qk8SBwXcMNVRqm/efTderob4IYxP8OBe5wjAIE7+vu8r6sqwxRIS54M
 Cyg=
 =DRSr
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:
 "Uprobes:

   - x86/shstk: Make return uprobe work with shadow stack

   - Add uretprobe syscall which speeds up the uretprobe 10-30% faster.
     This syscall is automatically used from user-space trampolines
     which are generated by the uretprobe. If this syscall is used by
     normal user program, it will cause SIGILL. Note that this is
     currently only implemented on x86_64.

     (This also has two fixes for adjusting the syscall number to avoid
     conflict with new *attrat syscalls.)

   - uprobes/perf: fix user stack traces in the presence of pending
     uretprobe. This corrects the uretprobe's trampoline address in the
     stacktrace with correct return address

   - selftests/x86: Add a return uprobe with shadow stack test

   - selftests/bpf: Add uretprobe syscall related tests.
      - test case for register integrity check
      - test case with register changing case
      - test case for uretprobe syscall without uprobes (expected to fail)
      - test case for uretprobe with shadow stack

   - selftests/bpf: add test validating uprobe/uretprobe stack traces

   - MAINTAINERS: Add uprobes entry. This does not specify the tree but
     to clarify who maintains and reviews the uprobes

  Kprobes:

   - tracing/kprobes: Test case cleanups.

     Replace redundant WARN_ON_ONCE() + pr_warn() with WARN_ONCE() and
     remove unnecessary code from selftest

   - tracing/kprobes: Add symbol counting check when module loads.

     This checks the uniqueness of the probed symbol on modules. The
     same check has already done for kernel symbols

     (This also has a fix for build error with CONFIG_MODULES=n)

  Cleanup:

   - Add MODULE_DESCRIPTION() macros for fprobe and kprobe examples"

* tag 'probes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  MAINTAINERS: Add uprobes entry
  selftests/bpf: Change uretprobe syscall number in uprobe_syscall test
  uprobe: Change uretprobe syscall scope and number
  tracing/kprobes: Fix build error when find_module() is not available
  tracing/kprobes: Add symbol counting check when module loads
  selftests/bpf: add test validating uprobe/uretprobe stack traces
  perf,uprobes: fix user stack traces in the presence of pending uretprobes
  tracing/kprobe: Remove cleanup code unrelated to selftest
  tracing/kprobe: Integrate test warnings into WARN_ONCE
  selftests/bpf: Add uretprobe shadow stack test
  selftests/bpf: Add uretprobe syscall call from user space test
  selftests/bpf: Add uretprobe syscall test for regs changes
  selftests/bpf: Add uretprobe syscall test for regs integrity
  selftests/x86: Add return uprobe shadow stack test
  uprobe: Add uretprobe syscall to speed up return probe
  uprobe: Wire up uretprobe system call
  x86/shstk: Make return uprobe work with shadow stack
  samples: kprobes: add missing MODULE_DESCRIPTION() macros
  fprobe: add missing MODULE_DESCRIPTION() macro
2024-07-18 12:19:20 -07:00
Linus Torvalds
b3ce7a3084 drm next for 6.11-rc1:
core:
 - deprecate DRM data and return 0 date
 - connector: Create a set of helpers to help with HDMI support
 - Remove driver owner assignments
 - Allow more drivers to compile with COMPILE_TEST
 - Conversions to drm_edid
 - Sprinkle MODULE_DESCRIPTIONS everywhere they are missing
 - Remove drm_mm_replace_node
 - print: Add a drm prefix to warn level messages too, remove
          ___drm_dbg, consolidate prefix handling
 - New monochrome TV mode variant
 
 ttm:
 - improve number of page faults on some platforms
 - fix test builds under PREEMPT_RT
 - more test coverage
 
 ci:
 - Require a more recent version of mesa,
 - improve farm setup and test generation
 
 dma-buf:
 - warn if reserving 0 fence slots
 - internal API heap enhancements
 
 fbdev:
 - Create memory manager optimized fbdev emulation
 
 panic:
 - Allow to select fonts,
 - improve drm_fb_dma_get_scanout_buffer
 - Allow to dump kmsg to the screen
 
 bridge:
 - Remove redundant checks on bridge->encoder
 - Remove drm_bridge_chain_mode_fixup
 - bridge-connector: Plumb in the new HDMI helper
 - analogix_dp: Various improvements, handle AUX transfers timeout
 - samsung-dsim: Fix timings calculation
 - tc358767: Plenty of small fixes, fix no connector attach, fix clocks
 - sii902x: state validation improvements
 
 panels:
 - Switch panels from register table initialization to proper code
 - Now that the panel code tracks the panel state, remove every
   ad-hoc implementation in the panel drivers
 - More cleanup of prepare / enable state tracking in drivers
 - edp: Drop legacy panel compatibles
 - simple-bridge: Switch to devm_drm_bridge_add
 - New panels: Lincoln Tech Sol LCD185-101CT, Microtips Technology
   13-101HIEBCAF0-C, Microtips Technology MF-103HIEB0GA0, BOE
   nv110wum-l60, IVO t109nw41, WL-355608-A8, PrimeView PM070WL4,
   Lincoln Technologies LCD197, Ortustech COM35H3P70ULC,
   AUO G104STN01, K&d kd101ne3-40ti
 
 amdgpu:
 - DCN 4.0.x support
 - GC 12.0 support
 - GMC 12.0 support
 - SDMA 7.0 support
 - MES12 support
 - MMHUB 4.1 support
 - GFX12 modifier and DCC support
 - lots of IP fixes/updates
 
 amdkfd:
 - Contiguous VRAM allocations
 - GC 12.0 support
 - SDMA 7.0 support
 - SR-IOV fixes
 - KFD GFX ALU exceptions
 
 i915:
 - Battlemage Xe2 HPD display enablement
 - Panel Replay enabling
 - DP AUX-less ALPM/LOBF
 - Enable link training failure fallback for DP MST links
 - CMRR (Content Match Refresh Rate) enabling
 - Increase ADL-S/ADL-P/DG2+ max TMDS bitrate to 6 Gbps
 - Enable eDP AUX based HDR backlight
 - Support replaying GPU hangs with captured context image
 - Automate CCS Mode setting during engine resets
 - lots of refactoring
 - Support replaying GPU hangs with captured context image
 - Increase FLR timeout from 3s to 9s
 - Enable w/a 16021333562 for DG2, MTL and ARL [guc]
 
 xe:
 - update MAINATINERS
 - New uapi adding OA functionality to Xe
 - expose l3 bank mask
 - fix display detect on ADL-N
 - runtime PM Fixes
 - Fix silent backmerge issues
 - More prep for SR-IOV
 - HWmon additions
 - per client usage info
 - Rework GPU page fault handling
 - Drop EXEC_QUEUE_FLAG_BANNED
 - Add BMG PCI IDs
 - Scheduler fixes and improvements
 - Rename xe_exec_queue::compute to xe_exec_queue::lr
 - Use ttm_uncached for BO with NEEDS_UC flag
 - Rename xe perf layer as xe observation layer
 - lots of refactoring
 
 radeon:
 - Backlight workaround for iMac
 - Silence UBSAN flex array warnings
 
 msm:
 - Validate registers XML description against schema in CI
 - core/dpu: SM7150 support
 - mdp5: Add support for MSM8937
 - gpu: Add param for userspace to know if raytracing is supported
 - gpu: X185 support (aka gpu in X1 laptop chips)
 - gpu: a505 support
 
 ivpu:
 - hardware scheduler support
 - profiling support
 - improvements to the platform support layer
 - firmware handling improvements
 - clocks/power mgmt improvements
 - scheduler/logging improvements
 
 habanalabs:
 - Gradual sleep in polling memory macro.
 - Reduce Gaudi2 MSI-X interrupt count to 128.
 - Add Gaudi2-D revision support.
 - Add timestamp to CPLD info.
 - Gaudi2: Assume hard-reset by firmware upon MC SEI severe error.
 - Align Gaudi2 interrupt names.
 - Check for errors after preboot is ready.
 - Change habanalabs maintainer and git repo path.
 
 mgag200:
 - refactoring and improvements
 - Add BMC output
 - enable polling
 
 nouveau:
 - add registry command line
 
 v3d:
 - perf counters improvements
 
 zynqmp:
 - irq and debugfs improvements
 
 atmel-hlcdc:
 - Support XLCDC in sam9x7
 
 mipi-dbi:
 - Remove mipi_dbi_machine_little_endian
 - make SPI bits per word configurable
 - support RGB888
 - allow pixel formats to be specified in the DT
 
 sun4i:
 - Rework the blender setup for DE2
 
 panfrost:
 - Enable MT8188 support
 
 vc4:
 - Monochrome TV support
 
 exynos:
 - fix fallback mode regression
 - fix memory leak
 - Use drm_edid_duplicate() instead of kmemdup()
 
 etnaviv:
 - fix i.MX8MP NPU clock gating
 - workaround FE register cdc issues on some cores
 - fix DMA sync handling for cached buffers
 - fix job timeout handling
 - keep TS enabled on MMUv2 cores for improved performance
 
 mediatek:
 - Convert to platform remove callback returning void-
 - Drop chain_mode_fixup call in mode_valid()
 - Fixes the errors of MediaTek display driver found by IGT.
 - Add display support for the MT8365-EVK board
 - Fix bit depth overwritten for mtk_ovl_set bit_depth()
 - Fix possible_crtcs calculation
 - Fix spurious kfree()
 
 ast:
 - refactor mode setting code
 
 stm:
 - Add LVDS support
 - DSI PHY updates
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmaYqVEACgkQDHTzWXnE
 hr5p3Q/+OOxTHKJ/8WMwfV1Tuep5otkCZdBgNdcuu9zqzpEMEDUDwmV1iboIvT9x
 qJsDwSAJomwbZAnVjDKsbZuycSHUBV6HQdf+5+rtq6be1EfFRwJVzOq0u5+D3KGt
 7f2vy6sM9tw4tR6EikiuP7vCvnSz4iGrWERvEJDEtXECbALhju8sulht8ZMnr6GW
 /MfUetULLSDjq0L1x3TWAq2MPGnJ5UxIkIeOBUP6n4etAUX1BPTNA6N76eN/xMvn
 a40JhtM+pCjjkHxvloIZ+KTYN3S+hskIRksczPHh9HtNX7y/A437wyhOHJZ1NvZb
 yc5ke9GjXxGcxyZH+PY5aCS7O/XElzSSkR1jFZ2s3/MX7PVKgCahGK7+yWjPsiK2
 R5oXebdObshUa8LHDE/3WgBUmTchkvKRTXV9cvGqzxEPhC2zrxArvwP5v6B4mhCn
 Vqo3Pv0Cyr+n65Z5Dzqz/9+m999LJjFTsTrug0p5b/qBJQKu2rQONe4lpZ0NFwwY
 ExyjdxILj7mqrQpKcA6V5Bel5ZCnlVsGfTshFL6Iux54VFlJyRMzKWZ+Gdv4av5k
 dbjz+re+CojKabn3ML/7pAQujK6Rqe58vPuHV78zkvAGJnQgJOOTrmYNYtn3oBqe
 ogdCN+/PREb/9U7i6mQv5hhdHs4tT9ROXaT9jyb8XSHXW+t9lBM=
 =g+Ad
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2024-07-18' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "There's a lot of stuff in here, amd, i915 and xe have new platform
  work, lots of core rework around EDID handling, some new COMPILE_TEST
  options, maintainer changes and a lots of other stuff. Summary:

  core:
   - deprecate DRM data and return 0 date
   - connector: Create a set of helpers to help with HDMI support
   - Remove driver owner assignments
   - Allow more drivers to compile with COMPILE_TEST
   - Conversions to drm_edid
   - Sprinkle MODULE_DESCRIPTIONS everywhere they are missing
   - Remove drm_mm_replace_node
   - print: Add a drm prefix to warn level messages too, remove
            ___drm_dbg, consolidate prefix handling
   - New monochrome TV mode variant

  ttm:
   - improve number of page faults on some platforms
   - fix test builds under PREEMPT_RT
   - more test coverage

  ci:
   - Require a more recent version of mesa
   - improve farm setup and test generation

  dma-buf:
   - warn if reserving 0 fence slots
   - internal API heap enhancements

  fbdev:
   - Create memory manager optimized fbdev emulation

  panic:
   - Allow to select fonts
   - improve drm_fb_dma_get_scanout_buffer
   - Allow to dump kmsg to the screen

  bridge:
   - Remove redundant checks on bridge->encoder
   - Remove drm_bridge_chain_mode_fixup
   - bridge-connector: Plumb in the new HDMI helper
   - analogix_dp: Various improvements, handle AUX transfers timeout
   - samsung-dsim: Fix timings calculation
   - tc358767: Plenty of small fixes, fix no connector attach, fix
               clocks
   - sii902x: state validation improvements

  panels:
   - Switch panels from register table initialization to proper code
   - Now that the panel code tracks the panel state, remove every ad-hoc
     implementation in the panel drivers
   - More cleanup of prepare / enable state tracking in drivers
   - edp: Drop legacy panel compatibles
   - simple-bridge: Switch to devm_drm_bridge_add
   - New panels: Lincoln Tech Sol LCD185-101CT, Microtips Technology
                 13-101HIEBCAF0-C, Microtips Technology MF-103HIEB0GA0,
                 BOE nv110wum-l60, IVO t109nw41, WL-355608-A8, PrimeView
                 PM070WL4, Lincoln Technologies LCD197, Ortustech
                 COM35H3P70ULC, AUO G104STN01, K&d kd101ne3-40ti

  amdgpu:
   - DCN 4.0.x support
   - GC 12.0 support
   - GMC 12.0 support
   - SDMA 7.0 support
   - MES12 support
   - MMHUB 4.1 support
   - GFX12 modifier and DCC support
   - lots of IP fixes/updates

  amdkfd:
   - Contiguous VRAM allocations
   - GC 12.0 support
   - SDMA 7.0 support
   - SR-IOV fixes
   - KFD GFX ALU exceptions

  i915:
   - Battlemage Xe2 HPD display enablement
   - Panel Replay enabling
   - DP AUX-less ALPM/LOBF
   - Enable link training failure fallback for DP MST links
   - CMRR (Content Match Refresh Rate) enabling
   - Increase ADL-S/ADL-P/DG2+ max TMDS bitrate to 6 Gbps
   - Enable eDP AUX based HDR backlight
   - Support replaying GPU hangs with captured context image
   - Automate CCS Mode setting during engine resets
   - lots of refactoring
   - Support replaying GPU hangs with captured context image
   - Increase FLR timeout from 3s to 9s
   - Enable w/a 16021333562 for DG2, MTL and ARL [guc]

  xe:
   - update MAINATINERS
   - New uapi adding OA functionality to Xe
   - expose l3 bank mask
   - fix display detect on ADL-N
   - runtime PM Fixes
   - Fix silent backmerge issues
   - More prep for SR-IOV
   - HWmon additions
   - per client usage info
   - Rework GPU page fault handling
   - Drop EXEC_QUEUE_FLAG_BANNED
   - Add BMG PCI IDs
   - Scheduler fixes and improvements
   - Rename xe_exec_queue::compute to xe_exec_queue::lr
   - Use ttm_uncached for BO with NEEDS_UC flag
   - Rename xe perf layer as xe observation layer
   - lots of refactoring

  radeon:
   - Backlight workaround for iMac
   - Silence UBSAN flex array warnings

  msm:
   - Validate registers XML description against schema in CI
   - core/dpu: SM7150 support
   - mdp5: Add support for MSM8937
   - gpu: Add param for userspace to know if raytracing is supported
   - gpu: X185 support (aka gpu in X1 laptop chips)
   - gpu: a505 support

  ivpu:
   - hardware scheduler support
   - profiling support
   - improvements to the platform support layer
   - firmware handling improvements
   - clocks/power mgmt improvements
   - scheduler/logging improvements

  habanalabs:
   - Gradual sleep in polling memory macro
   - Reduce Gaudi2 MSI-X interrupt count to 128
   - Add Gaudi2-D revision support
   - Add timestamp to CPLD info
   - Gaudi2: Assume hard-reset by firmware upon MC SEI severe error
   - Align Gaudi2 interrupt names
   - Check for errors after preboot is ready
   - Change habanalabs maintainer and git repo path

  mgag200:
   - refactoring and improvements
   - Add BMC output
   - enable polling

  nouveau:
   - add registry command line

  v3d:
   - perf counters improvements

  zynqmp:
   - irq and debugfs improvements

  atmel-hlcdc:
   - Support XLCDC in sam9x7

  mipi-dbi:
   - Remove mipi_dbi_machine_little_endian
   - make SPI bits per word configurable
   - support RGB888
   - allow pixel formats to be specified in the DT

  sun4i:
   - Rework the blender setup for DE2

  panfrost:
   - Enable MT8188 support

  vc4:
   - Monochrome TV support

  exynos:
   - fix fallback mode regression
   - fix memory leak
   - Use drm_edid_duplicate() instead of kmemdup()

  etnaviv:
   - fix i.MX8MP NPU clock gating
   - workaround FE register cdc issues on some cores
   - fix DMA sync handling for cached buffers
   - fix job timeout handling
   - keep TS enabled on MMUv2 cores for improved performance

  mediatek:
   - Convert to platform remove callback returning void-
   - Drop chain_mode_fixup call in mode_valid()
   - Fixes the errors of MediaTek display driver found by IGT
   - Add display support for the MT8365-EVK board
   - Fix bit depth overwritten for mtk_ovl_set bit_depth()
   - Fix possible_crtcs calculation
   - Fix spurious kfree()

  ast:
   - refactor mode setting code

  stm:
   - Add LVDS support
   - DSI PHY updates"

* tag 'drm-next-2024-07-18' of https://gitlab.freedesktop.org/drm/kernel: (2501 commits)
  drm/amdgpu/mes12: add missing opcode string
  drm/amdgpu/mes11: update opcode strings
  Revert "drm/amd/display: Reset freesync config before update new state"
  drm/omap: Restrict compile testing to PAGE_SIZE less than 64KB
  drm/xe: Drop trace_xe_hw_fence_free
  drm/xe/uapi: Rename xe perf layer as xe observation layer
  drm/amdgpu: remove exp hw support check for gfx12
  drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed
  drm/amdgpu: flush all cached ras bad pages to eeprom
  drm/amdgpu: select compute ME engines dynamically
  drm/amd/display: Allow display DCC for DCN401
  drm/amdgpu: select compute ME engines dynamically
  drm/amdgpu/job: Replace DRM_INFO/ERROR logging
  drm/amdgpu: select compute ME engines dynamically
  drm/amd/pm: Ignore initial value in smu response register
  drm/amdgpu: Initialize VF partition mode
  drm/amd/amdgpu: fix SDMA IRQ client ID <-> req mapping
  MAINTAINERS: fix Xinhui's name
  MAINTAINERS: update powerplay and swsmu
  drm/qxl: Pin buffer objects for internal mappings
  ...
2024-07-18 09:34:02 -07:00
Linus Torvalds
41906248d0 Power management updates for 6.11-rc1
- Add Loongson-3 CPUFreq driver support (Huacai Chen).
 
  - Add support for the Arrow Lake and Lunar Lake platforms and
    the out-of-band (OOB) mode on Emerald Rapids to the intel_pstate
    cpufreq driver, make it support the highest performance change
    interrupt and clean it up (Srinivas Pandruvada).
 
  - Switch cpufreq to new Intel CPU model defines (Tony Luck).
 
  - Simplify the cpufreq driver interface by switching the .exit() driver
    callback to the void return data type (Lizhe, Viresh Kumar).
 
  - Make cpufreq_boost_enabled() return bool (Dhruva Gole).
 
  - Add fast CPPC support to the amd-pstate cpufreq driver, address
    multiple assorted issues in it and clean it up (Perry Yuan, Mario
    Limonciello, Dhananjay Ugwekar, Meng Li, Xiaojian Du).
 
  - Add Allwinner H700 speed bin to the sun50i cpufreq driver (Ryan
    Walklin).
 
  - Fix memory leaks and of_node_put() usage in the sun50i and qcom-nvmem
    cpufreq drivers (Javier Carrasco).
 
  - Clean up the sti and dt-platdev cpufreq drivers (Jeff Johnson,
    Raphael Gallais-Pou).
 
  - Fix deferred probe handling in the TI cpufreq driver and wrong return
    values of ti_opp_supply_probe(), and add OPP tables for the AM62Ax and
    AM62Px SoCs to it (Bryan Brattlof, Primoz Fiser).
 
  - Avoid overflow of target_freq in .fast_switch() in the SCMI cpufreq
    driver (Jagadeesh Kona).
 
  - Use dev_err_probe() in every error path in probe in the Mediatek
    cpufreq driver (Nícolas Prado).
 
  - Fix kernel-doc param for longhaul_setstate in the longhaul cpufreq
    driver (Yang Li).
 
  - Fix system resume handling in the CPPC cpufreq driver (Riwen Lu).
 
  - Improve the teo cpuidle governor and clean up leftover comments from
    the menu cpuidle governor (Christian Loehle).
 
  - Clean up a comment typo in the teo cpuidle governor (Atul Kumar
    Pant).
 
  - Add missing MODULE_DESCRIPTION() macro to cpuidle haltpoll (Jeff
    Johnson).
 
  - Switch the intel_idle driver to new Intel CPU model defines (Tony
    Luck).
 
  - Switch the Intel RAPL driver new Intel CPU model defines (Tony Luck).
 
  - Simplify if condition in the idle_inject driver (Thorsten Blum).
 
  - Fix missing cleanup on error in _opp_attach_genpd() (Viresh Kumar).
 
  - Introduce an OF helper function to inform if required-opps is used
    and drop a redundant in-parameter to _set_opp_level() (Ulf Hansson).
 
  - Update pm-graph to v5.12 which includes fixes and major code revamp
    for python3.12 (Todd Brandt).
 
  - Address several assorted issues in the cpupower utility (Roman
    Storozhenko).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmaVb+8SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxXIUQALFhNTO+wo8uPWUmsp0SV81Sbf17zM0f
 9IDpzJTUZLK0stTdLtxY4khcClPE4MrwS/LjSJlvkEVZChHpUw6vFezHmx0O42Ti
 Tmv3ezABSAmx6QVRSpyVhE3Hb0BmXW9V+3dtoefofV0JWenN7mqk4Hbb2Jx1Cvbh
 zyerUeWWl97yqVMM2l5owKHSvk7SYO6cfML73XcdXQ6pBfQePfekG87i1+r40l+d
 qEzdyh6JjqGbdkvZKtI4zO1Hdai9FdlLWSqYmVZGS5XRN8RVvDaHDIDlSijNXAei
 DFPFoBVAvl8CymBXXnzDyJJhCCkEb2aX3xD6WzthoCygZt5W+tqfGxyZfViBfb55
 kvpyiWZUVaDyX4Hfz1PLnJ7Xg9kPUKUcDDrsV5vKA7W0Sq2T0RbORsVkaP2nIhlY
 4Xspp9nEv+78DG0UjT7jT0Py2Oq9I6BTG+pmMTxcgA7G/U5H2uAvvIM/kwQ+30vi
 yUxO3W5o9TQmvJF1klHgp3YsCNWZG3IYacHZzUIoPbPusEbevYrCuUNriT+zlANc
 Pv/FMfBfHDmU2lHWyLzuoKhlzQosNi9NajMANBJgd55zACWKzgNzFV4P5gIMd1KR
 moJYfosbT2RWetEH8Zrh7xA5dewUphe6tibshElbKJHilnP0iFjYhhdb6aQRcuPd
 q/RECFYT7z0r
 =imBx
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add a new cpufreq driver for Loongson-3, add support for new
  features in the intel_pstate (Lunar Lake and Arrow Lake platforms, OOB
  mode for Emerald Rapids, highest performance change interrupt),
  amd-pstate (fast CPPC) and sun50i (Allwinner H700 speed bin) cpufreq
  drivers, simplify the cpufreq driver interface, simplify the teo
  cpuidle governor, adjust the pm-graph utility for a new version of
  Python, address issues and clean up code.

  Specifics:

   - Add Loongson-3 CPUFreq driver support (Huacai Chen)

   - Add support for the Arrow Lake and Lunar Lake platforms and the
     out-of-band (OOB) mode on Emerald Rapids to the intel_pstate
     cpufreq driver, make it support the highest performance change
     interrupt and clean it up (Srinivas Pandruvada)

   - Switch cpufreq to new Intel CPU model defines (Tony Luck)

   - Simplify the cpufreq driver interface by switching the .exit()
     driver callback to the void return data type (Lizhe, Viresh Kumar)

   - Make cpufreq_boost_enabled() return bool (Dhruva Gole)

   - Add fast CPPC support to the amd-pstate cpufreq driver, address
     multiple assorted issues in it and clean it up (Perry Yuan, Mario
     Limonciello, Dhananjay Ugwekar, Meng Li, Xiaojian Du)

   - Add Allwinner H700 speed bin to the sun50i cpufreq driver (Ryan
     Walklin)

   - Fix memory leaks and of_node_put() usage in the sun50i and
     qcom-nvmem cpufreq drivers (Javier Carrasco)

   - Clean up the sti and dt-platdev cpufreq drivers (Jeff Johnson,
     Raphael Gallais-Pou)

   - Fix deferred probe handling in the TI cpufreq driver and wrong
     return values of ti_opp_supply_probe(), and add OPP tables for the
     AM62Ax and AM62Px SoCs to it (Bryan Brattlof, Primoz Fiser)

   - Avoid overflow of target_freq in .fast_switch() in the SCMI cpufreq
     driver (Jagadeesh Kona)

   - Use dev_err_probe() in every error path in probe in the Mediatek
     cpufreq driver (Nícolas Prado)

   - Fix kernel-doc param for longhaul_setstate in the longhaul cpufreq
     driver (Yang Li)

   - Fix system resume handling in the CPPC cpufreq driver (Riwen Lu)

   - Improve the teo cpuidle governor and clean up leftover comments
     from the menu cpuidle governor (Christian Loehle)

   - Clean up a comment typo in the teo cpuidle governor (Atul Kumar
     Pant)

   - Add missing MODULE_DESCRIPTION() macro to cpuidle haltpoll (Jeff
     Johnson)

   - Switch the intel_idle driver to new Intel CPU model defines (Tony
     Luck)

   - Switch the Intel RAPL driver new Intel CPU model defines (Tony
     Luck)

   - Simplify if condition in the idle_inject driver (Thorsten Blum)

   - Fix missing cleanup on error in _opp_attach_genpd() (Viresh Kumar)

   - Introduce an OF helper function to inform if required-opps is used
     and drop a redundant in-parameter to _set_opp_level() (Ulf Hansson)

   - Update pm-graph to v5.12 which includes fixes and major code revamp
     for python3.12 (Todd Brandt)

   - Address several assorted issues in the cpupower utility (Roman
     Storozhenko)"

* tag 'pm-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (77 commits)
  cpufreq: sti: fix build warning
  cpufreq: mediatek: Use dev_err_probe in every error path in probe
  cpufreq: Add Loongson-3 CPUFreq driver support
  cpufreq: Make cpufreq_driver->exit() return void
  cpufreq/amd-pstate: Fix the scaling_max_freq setting on shared memory CPPC systems
  cpufreq/amd-pstate-ut: Convert nominal_freq to khz during comparisons
  cpufreq: pcc: Remove empty exit() callback
  cpufreq: loongson2: Remove empty exit() callback
  cpufreq: nforce2: Remove empty exit() callback
  cpupower: fix lib default installation path
  cpufreq: docs: Add missing scaling_available_frequencies description
  cpuidle: teo: Don't count non-existent intercepts
  cpupower: Disable direct build of the 'bench' subproject
  cpuidle: teo: Remove recent intercepts metric
  Revert: "cpuidle: teo: Introduce util-awareness"
  cpufreq: make cpufreq_boost_enabled() return bool
  cpufreq: intel_pstate: Support highest performance change interrupt
  x86/cpufeatures: Add HWP highest perf change feature flag
  Documentation: cpufreq: amd-pstate: update doc for Per CPU boost control method
  cpufreq: amd-pstate: Cap the CPPC.max_perf to nominal_perf if CPB is off
  ...
2024-07-16 15:54:03 -07:00
Linus Torvalds
ce5a51bfac hardening updates for v6.11-rc1
- lkdtm/bugs: add test for hung smp_call_function_single() (Mark Rutland)
 
 - gcc-plugins: Remove duplicate included header file stringpool.h
   (Thorsten Blum)
 
 - ARM: Remove address checking for MMUless devices (Yanjun Yang)
 
 - randomize_kstack: Clean up per-arch entropy and codegen
 
 - KCFI: Make FineIBT mode Kconfig selectable
 
 - fortify: Do not special-case 0-sized destinations
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmaVT2IACgkQiXL039xt
 wCbq8A//RhxTdr+l/h2gyMy/Lcy/NMR9KEWklnxdftuM1V1Kzr53yeH/g6Ehw69g
 e8Ag3Sp7Fn4rNBVa+tY6RqzKwfrUHIbeewGI4LkRe19NDWFWc/Od+4tamfRSPf9c
 GL9ZnJZviRm3zByetwr4CbS69HocXFFSSgcpIv/7xOd+haSWWdvEc3KcSnavY/aq
 8wQPkZxzy8ESkOajZj2k0E2l9JP42Ex20qy0KcjweSSYVafKmbTxhKZgriwAKMCD
 Yj2m55fbD6D08vd0Y6S7H4TPilYtRbulXR9FNMtw59UpKeoUceEmyn4B43psDvau
 9XuJF/oFKrXBEJG+OUZogNu5L6uYUaNdYdtb43upu9lCsjrAjmMYfmXDHO2E40V8
 76MikxHtyFAPEzUwg/BH2CGUu9hil+FADd28s8zLuUBpRDitgYudQD+Cqrc34b6s
 QlAX19bX7KFgXqlsdwy6zJNSd3dpoMBVsP58/EhQQfiqv/ZU2TOryZenz0URlH+k
 ZCAbpXYRAzTyGz23qkutRO+6MiKXoheE7gmd9jESiaqyXe2Q6mIMPyoFU50458TH
 xXhXbZc7War8vbJLyWF7fvK/GlooTHu4xOxfNTsxKWiYShI01iiwG1hH+j4ZDVOG
 NBBK2AfX9GM8AOHJolp5EaGmon0AoVsxbRANSs1K4qZ93WTNGLk=
 =LoG2
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:

 - lkdtm/bugs: add test for hung smp_call_function_single() (Mark
   Rutland)

 - gcc-plugins: Remove duplicate included header file stringpool.h
   (Thorsten Blum)

 - ARM: Remove address checking for MMUless devices (Yanjun Yang)

 - randomize_kstack: Clean up per-arch entropy and codegen

 - KCFI: Make FineIBT mode Kconfig selectable

 - fortify: Do not special-case 0-sized destinations

* tag 'hardening-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  randomize_kstack: Improve stack alignment codegen
  ARM: Remove address checking for MMUless devices
  gcc-plugins: Remove duplicate included header file stringpool.h
  randomize_kstack: Remove non-functional per-arch entropy filtering
  fortify: Do not special-case 0-sized destinations
  x86/alternatives: Make FineIBT mode Kconfig selectable
  lkdtm/bugs: add test for hung smp_call_function_single()
2024-07-16 13:45:43 -07:00
Linus Torvalds
e55037c879 EFI updates for v6.11
- Drop support for the 'fake' EFI memory map on x86
 
 - Add an SMBIOS based tweak to the EFI stub instructing the firmware on
   x86 Macbook Pros to keep both GPUs enabled
 
 - Replace 0-sized array with flexible array in EFI memory attributes
   table handling
 
 - Drop redundant BSS clearing when booting via the native PE entrypoint
   on x86
 
 - Avoid returning EFI_SUCCESS when aborting on an out-of-memory
   condition
 
 - Cosmetic tweak for arm64 KASLR loading logic
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZpTg5gAKCRAwbglWLn0t
 XOrOAQCpZjtjkPRPCBY+t3wUl84rOKiPr1SMHyL50Zl8udJKegD/bnwWSgX3FzLQ
 TN+xjnK7IAxEoKAEWt8lnt04cH5r3As=
 =7VWO
 -----END PGP SIGNATURE-----

Merge tag 'efi-next-for-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI updates from Ard Biesheuvel:
 "Note the removal of the EFI fake memory map support - this is believed
  to be unused and no longer worth supporting. However, we could easily
  bring it back if needed.

  With recent developments regarding confidential VMs and unaccepted
  memory, combined with kexec, creating a known inaccurate view of the
  firmware's memory map and handing it to the OS is a feature we can
  live without, hence the removal. Alternatively, I could imagine making
  this feature mutually exclusive with those confidential VM related
  features, but let's try simply removing it first.

  Summary:

   - Drop support for the 'fake' EFI memory map on x86

   - Add an SMBIOS based tweak to the EFI stub instructing the firmware
     on x86 Macbook Pros to keep both GPUs enabled

   - Replace 0-sized array with flexible array in EFI memory attributes
     table handling

   - Drop redundant BSS clearing when booting via the native PE
     entrypoint on x86

   - Avoid returning EFI_SUCCESS when aborting on an out-of-memory
     condition

   - Cosmetic tweak for arm64 KASLR loading logic"

* tag 'efi-next-for-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: Replace efi_memory_attributes_table_t 0-sized array with flexible array
  efi: Rename efi_early_memdesc_ptr() to efi_memdesc_ptr()
  arm64/efistub: Clean up KASLR logic
  x86/efistub: Drop redundant clearing of BSS
  x86/efistub: Avoid returning EFI_SUCCESS on error
  x86/efistub: Call Apple set_os protocol on dual GPU Intel Macs
  x86/efistub: Enable SMBIOS protocol handling for x86
  efistub/smbios: Simplify SMBIOS enumeration API
  x86/efi: Drop support for fake EFI memory maps
2024-07-16 12:22:07 -07:00
Linus Torvalds
408323581b - Add support for running the kernel in a SEV-SNP guest, over a Secure
VM Service Module (SVSM).
 
    When running over a SVSM, different services can run at different
    protection levels, apart from the guest OS but still within the
    secure SNP environment.  They can provide services to the guest, like
    a vTPM, for example.
 
    This series adds the required facilities to interface with such a SVSM
    module.
 
  - The usual fixlets, refactoring and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaWQuoACgkQEsHwGGHe
 VUrmEw/+KqM5DK5cfpue3gn0RfH6OYUoFxOdYhGkG53qUMc3c3ka5zPVqLoHPkzp
 WPXha0Z5pVdrcD9mKtVUW9RIuLjInCM/mnoNc3tIUL+09xxemAjyG1+O+4kodiU7
 sZ5+HuKUM2ihoC4Rrm+ApRrZfH4+WcgQNvFky77iObWVBo4yIscS7Pet/MYFvuuz
 zNaGp2SGGExDeoX/pMQNI3S9FKYD26HR17AUI3DHpS0teUl2npVi4xDjFVYZh0dQ
 yAhTKbSX3Q6ekDDkvAQUbxvWTJw9qoIsvLO9dvZdx6SSWmzF9IbuECpQKGQwYcp+
 pVtcHb+3MwfB+nh5/fHyssRTOZp1UuI5GcmLHIQhmhQwCqPgzDH6te4Ud1ovkxOu
 3GoBre7KydnQIyv12I+56/ZxyPbjHWmn8Fg106nAwGTdGbBJhfcVYfPmPvwpI4ib
 nXpjypvM8FkLzLAzDK6GE9QiXqJJlxOn7t66JiH/FkXR4gnY3eI8JLMfnm5blAb+
 97LC7oyeqtstWth9/4tpCILgPR2tirrMQGjUXttgt+2VMzqnEamnFozsKvR95xok
 4j6ulKglZjdpn0ixHb2vAzAcOJvD7NP147jtCmXH7M6/f9H1Lih3MKdxX98MVhWB
 wSp16udXHzu5lF45J0BJG8uejSgBI2y51jc92HLX7kRULOGyaEo=
 =u15r
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - Add support for running the kernel in a SEV-SNP guest, over a Secure
   VM Service Module (SVSM).

   When running over a SVSM, different services can run at different
   protection levels, apart from the guest OS but still within the
   secure SNP environment. They can provide services to the guest, like
   a vTPM, for example.

   This series adds the required facilities to interface with such a
   SVSM module.

 - The usual fixlets, refactoring and cleanups

[ And as always: "SEV" is AMD's "Secure Encrypted Virtualization".

  I can't be the only one who gets all the newer x86 TLA's confused,
  can I?
              - Linus ]

* tag 'x86_sev_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation/ABI/configfs-tsm: Fix an unexpected indentation silly
  x86/sev: Do RMP memory coverage check after max_pfn has been set
  x86/sev: Move SEV compilation units
  virt: sev-guest: Mark driver struct with __refdata to prevent section mismatch
  x86/sev: Allow non-VMPL0 execution when an SVSM is present
  x86/sev: Extend the config-fs attestation support for an SVSM
  x86/sev: Take advantage of configfs visibility support in TSM
  fs/configfs: Add a callback to determine attribute visibility
  sev-guest: configfs-tsm: Allow the privlevel_floor attribute to be updated
  virt: sev-guest: Choose the VMPCK key based on executing VMPL
  x86/sev: Provide guest VMPL level to userspace
  x86/sev: Provide SVSM discovery support
  x86/sev: Use the SVSM to create a vCPU when not in VMPL0
  x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0
  x86/sev: Use kernel provided SVSM Calling Areas
  x86/sev: Check for the presence of an SVSM in the SNP secrets page
  x86/irqflags: Provide native versions of the local_irq_save()/restore()
2024-07-16 11:12:25 -07:00
Linus Torvalds
b84b338190 - Enable Sub-NUMA clustering to work with resource control on Intel by
teaching resctrl to handle scopes due to the clustering which
    partitions the L3 cache into sets. Modify and extend the subsystem to
    handle such scopes properly
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaWGPcACgkQEsHwGGHe
 VUrLtQ/9GnY6EZDXQf6gF50FuasOrjaJw3bzSN6N0Hy28BEgG0fFrZzAKYRUvJXl
 s16JkgQrQB3JaoT4bwcaSvMvBTtc+1cDuxMYI3C7jtBkjGFRwOgsCp/Hr2xujaKK
 IfOJNmDLx2YRuxFyfi1FK4b1YqZ1gtg5FcmmaelBCu/rkQcBC9S7VtqGqCjwmhxy
 l5WVDzMdXB++cxEJz1fBCyjdPgAwhEmNm0fnxGc0je1EvJUczd2o8Us3ND8Sw5x1
 +5JL4PjwSMlFa71yw+rTzUs9u01SAI3IxvU6sPhmxhr3O4is4rGusyUldiz1598r
 U+bYWivGn1ksVPifo0c6UUtbpaO9KLAnxsiRct7FKZdBfaqXi13twi1918aVyECJ
 8pW0R8c/W3kQYMPOlhwBIzJp31rPzAxu70k9DT0cShAzKk/EbIWZAuZGqMz9bhfS
 pcfCdD+36C/jN57KIhzo3GamzgHee40MQMLBKjFe1etZFit2EjyUK/jZhdYZWckj
 +mOyWLngLVzF2mIkFrmw4VDRHsSqZlBGSHwHyiC+J+lL+nO9N9xQrtxm4z8TimLY
 QquDSTYdqi2dGYVpN4vIOktn40A43UxirKC1X3fVqQRz71LcYGe28tMlQ99kUUJR
 H8PGajlxfSB1CWNZpgaHGTMzU09ojHvJYmXy2p1HJf4fcBiXOV4=
 =LITm
 -----END PGP SIGNATURE-----

Merge tag 'x86_cache_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 resource control updates from Borislav Petkov:

 - Enable Sub-NUMA clustering to work with resource control on Intel by
   teaching resctrl to handle scopes due to the clustering which
   partitions the L3 cache into sets. Modify and extend the subsystem to
   handle such scopes properly

* tag 'x86_cache_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Update documentation with Sub-NUMA cluster changes
  x86/resctrl: Detect Sub-NUMA Cluster (SNC) mode
  x86/resctrl: Enable shared RMID mode on Sub-NUMA Cluster (SNC) systems
  x86/resctrl: Make __mon_event_count() handle sum domains
  x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter
  x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode
  x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files
  x86/resctrl: Allocate a new field in union mon_data_bits
  x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
  x86/resctrl: Initialize on-stack struct rmid_read instances
  x86/resctrl: Add a new field to struct rmid_read for summation of domains
  x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files
  x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Prepare for different scope for control/monitor operations
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for new domain scope
2024-07-16 10:53:54 -07:00
Linus Torvalds
d679783188 - Flip the logic to add feature names to /proc/cpuinfo to having to
explicitly specify the flag if there's a valid reason to show it in
   /proc/cpuinfo
 
 - Switch a bunch of Intel x86 model checking code to the new CPU model
   defines
 
 - Fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVZ+EACgkQEsHwGGHe
 VUqTgA//aJez6C5SmuqIofqgimr+8JGNThf4vFB3O9tN0ony3IR8IRieF+sOZFXE
 WVyN7KOhPs2XvNzVAaJpzWUcg/E2bXzVrOKfx3uFiyNiBttKLVot7Hl640wqWGoG
 eTViTpQ6IALY7lEI6vFNXz+4Ja5PWmHxWdBkvP9ehSvqNxHivTWL4HQ11pcCWQEA
 i+V37PbOHsnH7ZprJtaV0ihtjFblk9/R4qoZuT3SObhG0QDJK4Q7yYUelxXMUUgD
 Yo3nXluQl6Vc5dD2ULYkTlhzMxoZUMURty897vYSsZz49ZXsS6fsvd+BheSQVOv1
 hzaqqFYijdIpPI1zwgAPM+e6S/EAafpNVcEkjhHGZIJehwXm3teoSlX5tK2NPGoe
 PLYrwPWAzagdS3dWvrvBYT3Bu7pygieDSyPFfVP2XQsElHsWhYvBtxeH/uUwm+v4
 xjtXaJUj9eznChPaDZhCl8ioh9szUKHsh2NJ5ND7qpxPCFpz1Xj9ZmbIYTjHEgjG
 IT8dFfykKdyh5htJWw/P8LbexpEMTmu/LDrDXt+tFsDLBKIkeLiP3h8+yDR+vJ7K
 OGBjY2ciSi9Wy9ynunCOCNHNBdia1qc3AJWSg/2YP4NW+RzRLe6cIs+Ih4s1N5lx
 ADvw+TA9CAKo1KASyOVYAxq7h4xlsyH6jbCC3ZW3P/a+Bs8smqM=
 =SEED
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu model updates from Borislav Petkov:

 - Flip the logic to add feature names to /proc/cpuinfo to having to
   explicitly specify the flag if there's a valid reason to show it in
   /proc/cpuinfo

 - Switch a bunch of Intel x86 model checking code to the new CPU model
   defines

 - Fixes and cleanups

* tag 'x86_cpu_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/intel: Drop stray FAM6 check with new Intel CPU model defines
  x86/cpufeatures: Flip the /proc/cpuinfo appearance logic
  x86/CPU/AMD: Always inline amd_clear_divider()
  x86/mce/inject: Add missing MODULE_DESCRIPTION() line
  perf/x86/rapl: Switch to new Intel CPU model defines
  x86/boot: Switch to new Intel CPU model defines
  x86/cpu: Switch to new Intel CPU model defines
  perf/x86/intel: Switch to new Intel CPU model defines
  x86/virt/tdx: Switch to new Intel CPU model defines
  x86/PCI: Switch to new Intel CPU model defines
  x86/cpu/intel: Switch to new Intel CPU model defines
  x86/platform/intel-mid: Switch to new Intel CPU model defines
  x86/pconfig: Remove unused MKTME pconfig code
  x86/cpu: Remove useless work in detect_tme_early()
2024-07-15 20:25:16 -07:00
Linus Torvalds
2439a5eaa7 - Add a spectre_bhi=vmexit mitigation option aimed at cloud
environments
 
  - Remove duplicated Spectre cmdline option documentation
 
  - Add separate macro definitions for syscall handlers which do not
    return in order to address objtool warnings
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVXXMACgkQEsHwGGHe
 VUrd3A/9FFJZcpxdpWJikyEskb3CO1xthfM/6QvV5U3/Nldpz4aROEteqsMYc+xB
 OcA/RkCc8mBBFuydZjNxlNwyMXkoab/rQJC/Dz7q1O61sho4RWk8yCh6xM1JRofF
 WeKGCClz1KnsCc8FlVaHAEhp6gBMJiiqawjXBklfHhUqmbY7UZgcAyeM3uMIwAEG
 qCS7opOSZVijJadoyvROf5na23hggUVO++qS4HYT66G3bI3MdEEWp06dUxXBD/Er
 2zRAY6III4wuGTxe8L49ftsyW9RS7AKY2rUmhpffkeA8tLYBfXogYVSQYyR3S9Ou
 gZg9Yeu64rjqZZUYpzRR+kATUpuSKO6nQBHxd+ICRIUbzSmXUNzvPTi5SWSWh2vC
 HTLgFbGXxg8fLlpqCJ21oaU982w3eteOJ+wgf/AH3hBykFljck9EcaGsaQ5OfeDE
 MA0XaDy2V4jypyxmLpRfRIWJWtNVTgza2Jl0Dg3X+UipAXtvCvJzW1ZJ0ksA+2P0
 K1GeWy4tC51uFndeYpNC1eQ0cJjv1mfAugHcqgVdAhwMYUZdXchaPJHr/fcF7AEG
 xjV7fnoGK6WKKUni+Tnmom3FzBVDztKAtZ4iYgwIWReRj9bKLhP2k779rMXkCftt
 WtiencSCtVn+K/4acYBx0vbRKlDv769Lq64FZ8xNgGw6uRXjhhM=
 =AP9P
 -----END PGP SIGNATURE-----

Merge tag 'x86_bugs_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu mitigation updates from Borislav Petkov:

 - Add a spectre_bhi=vmexit mitigation option aimed at cloud
   environments

 - Remove duplicated Spectre cmdline option documentation

 - Add separate macro definitions for syscall handlers which do not
   return in order to address objtool warnings

* tag 'x86_bugs_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bugs: Add 'spectre_bhi=vmexit' cmdline option
  x86/bugs: Remove duplicate Spectre cmdline option descriptions
  x86/syscall: Mark exit[_group] syscall handlers __noreturn
2024-07-15 20:07:27 -07:00
Linus Torvalds
f998678baf - Add a unified VMware hypercall API layer which should be used by all
callers instead of them doing homegrown solutions. This will provide for
    adding API support for confidential computing solutions like TDX
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVO4kACgkQEsHwGGHe
 VUp4tw//en2ywe8nqoO8a5WIxIcc6wtMTYEboqu5q7RzWJzHVRsAz72USeMlQgBB
 ywNnn2H0SgVqfcLOMkAzsEarvPUJR0ZThvYxyIStcFzqIWMbtuhazMx/tVsR+9jD
 LqIFWrSeXPE+w005srnXZb7qxvC4cDyGdRL9xHa6UoN/Io2oTEidNWs825KoLWhN
 OPqWfLrvm+Bb+JMaLYQC6UQsJk1ds91WlI3k7CdYk1sNgkTfwGHlDulwrhzM0oG0
 EcVBKW8xsOxg4ylYS5j42ykE1z+FUMpSQ+tq7fo/SUbrgTr55xhDpxi8rsS2P5xX
 fErsYBOEY228YT8V1fpaJMY1f7HLhZqy5jrODvDHCI6E3wasQuzl9Dc+OpwmN5NA
 gR9BQIoAQgpZSpTsCG6qJagx5FYmS3bY1yXmzEsTmrzmchXQ0QQqInJw61qdHO4F
 +LZYj7pOQzKlVEkrpBeWMnWMh+RmumaW0SsHVahvutzH3OA3yLjZl117S3dDiY7K
 A4cqaX4A0KeCSUkXha7NuSRDtDIevAYhIEvcoUr5Xv2FgRO2c7N1rzzCdH3ML0fZ
 Pzmjh24s91YqxY/s0YnJ57glKJfGcx0VKzPaw80/rxJ9sVb4HK2GkBOODuJhP8Iw
 rF8qIfEmRHsyJdvRkF6pSl7hIEJth/khW0qNRF8PivzCtnpDBO8=
 =4VPt
 -----END PGP SIGNATURE-----

Merge tag 'x86_vmware_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 vmware updates from Borislav Petkov:

 - Add a unified VMware hypercall API layer which should be used by all
   callers instead of them doing homegrown solutions. This will provide
   for adding API support for confidential computing solutions like TDX

* tag 'x86_vmware_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vmware: Add TDX hypercall support
  x86/vmware: Remove legacy VMWARE_HYPERCALL* macros
  x86/vmware: Correct macro names
  x86/vmware: Use VMware hypercall API
  drm/vmwgfx: Use VMware hypercall API
  input/vmmouse: Use VMware hypercall API
  ptp/vmware: Use VMware hypercall API
  x86/vmware: Introduce VMware hypercall API
2024-07-15 20:05:40 -07:00
Linus Torvalds
222dfb8326 - Make error checking of AMD SMN accesses more robust in the callers as
they're the only ones who can interpret the results properly
 
  - The usual cleanups and fixes, left and right
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVOU0ACgkQEsHwGGHe
 VUqeFBAAl9X4bj08GwSAXfqBangXaGpKO4Nx0VZiFCYDkQ/TDnchMEBbpRWSuVzS
 SEnVSrcAXCxKqhv295UyFMmv2a+q3UUidkxTzRfznekMZMMylHYcfCFrg16w9ZNJ
 N/cBquTu96hSJHd2/usNUvNPLllTrMoIg3gofBav+NTaHQQDmzvM5htfewREY9OF
 SRS/86o3u5oIsRKKiJRyzfLzzX9lEGUvU+lvxv/yu1x2Q6SG0guhfM3HeaSxCIOs
 yeB23bwe/N/pO5KlqOtEJJL49Ypu2k/jfiS2rhH6AxSqNfXVpBlDbnahu9sA973n
 irzWwycJhVU4OQ3pqmPXdcKDqn7GmUWDsjrkEIOqJeBCSukmlM7APi8Ss8yGZ3X4
 HgDw10c900ldrxSo0H5PdpeULvowpeptpzBY8gzcdum4s0vNUvZLy/n1AKo7ydea
 oJ+ZBdXvywnR66uGQLkTxLvpGTNgyFrKDORHuyOAwJTN5CbLuco2SV/82mkcQCZt
 sAgyiWFvIcLoHZPfY8BNztYWVX01lWDIxFHJE8ca/B97mBeZCC3w1DnHJla8Kxsg
 zCMV0yn61BdMvjVS9AGaKqEuN0gYYrs/QOjtOp5ggAv7QC1ke/wqgZoFGvLbmcP9
 pIf8GzCt34u3tACGAl76toP0rtnMjGvKD8xXdHGHf7AAj1jKo28=
 =rd6Q
 -----END PGP SIGNATURE-----

Merge tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 updates from Borislav Petkov:

 - Make error checking of AMD SMN accesses more robust in the callers as
   they're the only ones who can interpret the results properly

 - The usual cleanups and fixes, left and right

* tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kmsan: Fix hook for unaligned accesses
  x86/platform/iosf_mbi: Convert PCIBIOS_* return codes to errnos
  x86/pci/xen: Fix PCIBIOS_* return code handling
  x86/pci/intel_mid_pci: Fix PCIBIOS_* return code handling
  x86/of: Return consistent error type from x86_of_pci_irq_enable()
  hwmon: (k10temp) Rename _data variable
  hwmon: (k10temp) Remove unused HAVE_TDIE() macro
  hwmon: (k10temp) Reduce k10temp_get_ccd_support() parameters
  hwmon: (k10temp) Define a helper function to read CCD temperature
  x86/amd_nb: Enhance SMN access error checking
  hwmon: (k10temp) Check return value of amd_smn_read()
  EDAC/amd64: Check return value of amd_smn_read()
  EDAC/amd64: Remove unused register accesses
  tools/x86/kcpuid: Add missing dir via Makefile
  x86, arm: Add missing license tag to syscall tables files
2024-07-15 19:53:07 -07:00
Linus Torvalds
98896d8795 - Unrelated x86/cc changes queued here to avoid ugly cross-merges and
conflicts:
 
    - Carve out CPU hotplug function declarations into a separate header
      with the goal to be able to use the lockdep assertions in a more
      flexible manner
 
    - As a result, refactor cacheinfo code after carving out a function
      to return the cache ID associated with a given cache level
 
    -  Cleanups
 
 - Add support to be able to kexec TDX guests. For that
 
    - Expand ACPI MADT CPU offlining support
 
    - Add machinery to prepare CoCo guests memory before kexec-ing into a new
      kernel
 
    - Cleanup, readjust and massage related code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVCYoACgkQEsHwGGHe
 VUoi6g//Up/4vMzcjqzrndXfl0aP+NpK4zNud+ZPP4Qza2yPhKydniMvkWVQ8DTx
 jQaGk/tJDeFG6ofOzGkmBGyuZzuO4D7E0XFyXZZeVgSvdk2Af5vaWu1D3e4i4MiM
 Ox4H8NtWnC4MozP0hos4qB0vtYaBWVJkNvIXDVF6162zLwEmbuyrpFe3glscwIxv
 hMZR/C47RHcEeOb7yA4m/gJ+AqMe9OKradoNJkkfDpnYr6CYsbmpY09or2WYuvoI
 0gevkIe6Q9HMcq3CQl6/pR8IgbA5VmGi7iCiE1ihgTPwR3AaU8llzBqYdSgezFrk
 68A7oGeUZQeifQgjwkreZclMtsGEeGWVOB0Bh3Jgr6uaWGFXtpydi/hc73wbTz+F
 IazKQcKQYjaPW/9UG+0+cFTQlCgQ+WxwqAsN1uqzL6gMgmC9B+TM//xzk5nVxpOd
 ouf8T85tyceIPCKepGE/bWEHYYCjfbqBMyQT6RHmxUKbb1/PIsbzN26cenkZmPXT
 cpwurWVG7mRQJRqTrsS+D+opP1h/jOdkpwGlBfl1s0sX6RZuMFBk+7TlMMs61Cyo
 PWtrLV7Dr369cuXE72wIgfBAao2AS8kFshc7Atokq7/XfL9cCWHeqIcu7yvParP5
 WY43YQv8XPGI7ZnPqULByTY0Wxg8TFk8whamx97kEp8uy2HmbQU=
 =k+T+
 -----END PGP SIGNATURE-----

Merge tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 confidential computing updates from Borislav Petkov:
 "Unrelated x86/cc changes queued here to avoid ugly cross-merges and
  conflicts:

   - Carve out CPU hotplug function declarations into a separate header
     with the goal to be able to use the lockdep assertions in a more
     flexible manner

   - As a result, refactor cacheinfo code after carving out a function
     to return the cache ID associated with a given cache level

   - Cleanups

  Add support to be able to kexec TDX guests:

   - Expand ACPI MADT CPU offlining support

   - Add machinery to prepare CoCo guests memory before kexec-ing into a
     new kernel

   - Cleanup, readjust and massage related code"

* tag 'x86_cc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  ACPI: tables: Print MULTIPROC_WAKEUP when MADT is parsed
  x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  x86/mm: Introduce kernel_ident_mapping_free()
  x86/smp: Add smp_ops.stop_this_cpu() callback
  x86/acpi: Do not attempt to bring up secondary CPUs in the kexec case
  x86/acpi: Rename fields in the acpi_madt_multiproc_wakeup structure
  x86/mm: Do not zap page table entries mapping unaccepted memory table during kdump
  x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI ranges
  x86/tdx: Convert shared memory back to private on kexec
  x86/mm: Add callbacks to prepare encrypted memory for kexec
  x86/tdx: Account shared memory
  x86/mm: Return correct level from lookup_address() if pte is none
  x86/mm: Make x86_platform.guest.enc_status_change_*() return an error
  x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  x86/relocate_kernel: Use named labels for less confusion
  cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
  cpu/hotplug: Add support for declaring CPU offlining not supported
  x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init
  x86/acpi: Extract ACPI MADT wakeup code into a separate file
  x86/kexec: Remove spurious unconditional JMP from from identity_mapped()
  ...
2024-07-15 19:36:01 -07:00
Linus Torvalds
4578d072fa - Add a check to warn when cmdline parsing happens before the final cmdline
string has been built and thus arguments can get lost
 
  - Code cleanups and simplifications
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVAgQACgkQEsHwGGHe
 VUrXsxAAsnJiihOXaU/VPfuRx5d/URufo1HxLPjR5D0YXuzCEbFUS3/9UleAsg0Z
 h/hKBPtC4o9OJWqo1EIbpmCaIqMxuYZgLEQ1n2tx60FGFVfY/9H8PmqPSgMdeoPC
 HBseXzLzNy6BWeIbRIc3FCk1MF1HR83hs1aiaCJVBm19kmz4n4aZ4zRr4CNIug+0
 6kNtLWiNYW2kw6J/2zoIStVkScIzxIFcMVz7KgA4S6RIOPLaints9Nf4jNl2mp5n
 UEZy9OQEgf8h+3KI5dB5uUhckuteQSSeL6K0YJ869pRN63hOtU7MCc8PSgMpPAbX
 4s/wKYRp2l4EfEOVCJimFs/yJKeIDjOW0ivuKJ/5DvqtyXG5PMBdt8HCBlpUb/cr
 Qi4dd4/u1pUk/vJpykZq/5H6zDWym2Q2WDjOCE8K2DOi3YBY+Ia7HrBXSyQyYAJ6
 Rq8Xu6Lq+Lqgg9/7HZizoc8y6wRyzhuYpkqJWvLN57rJ5dNNKKuJyuwCyAupw4o1
 b4gfQ5KgUyG8VAs7dSqhEBzL8zrXZlbOhkeDXUUHtKw6AxS9p4LDIzKVwc6QHdAe
 0V2soGoAYv24RoAEUeVEeaIHMkKdq600W/9yNFzogNvRvFyXp+jXCR3kCtNz6TJ2
 VvioFlJw4y99UPguKi/nzyTA1EdAVVhYYgl39wTnMDOQHxSv2o0=
 =GWkn
 -----END PGP SIGNATURE-----

Merge tag 'x86_boot_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot updates from Borislav Petkov:

 - Add a check to warn when cmdline parsing happens before the final
   cmdline string has been built and thus arguments can get lost

 - Code cleanups and simplifications

* tag 'x86_boot_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/setup: Warn when option parsing is done too early
  x86/boot: Clean up the arch/x86/boot/main.c code a bit
  x86/boot: Use current_stack_pointer to avoid asm() in init_heap()
2024-07-15 19:31:59 -07:00
Linus Torvalds
208c6772d3 - This is basically PeterZ's idea to nest the alternative macros to avoid the
need to "spell out" the number of alternates in an ALTERNATIVE_n() macro and
   thus have an ever-increasing complexity in those definitions.
 
   For ease of bisection, the old macros are converted to the new, nested
   variants in a step-by-step manner so that in case an issue is encountered
   during testing, one can pinpoint the place where it fails easier. Because
   debugging alternatives is a serious pain.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaU/+MACgkQEsHwGGHe
 VUpMGhAAqVbB3DZohv0Oa4BRRvaKFuQ3L7H0NTjK/pbT3EG+phol0zrHby2MnGjD
 HWXskps86n91QBB/06vAyZRimV/dvAPvlSKllsRx6ie3VCE4FJPzA4nTWQn/41dC
 HWamj78mQuSMgioLzIYdTY79KObtJcUw/X/xz+TTMemfkzkQxukKY7+Y71nZbuKi
 rUuCSrfAWNHQaIaoGs2JowGw7te7yNOtKQMCW5TdNLwvJfOAECuoLIFeiEcWHvoO
 uGl6FTABNLp26wmaeceUxdjBbTJcM3iV3joZQYED7B+mbJcU/a7tZw7I+mavPrbh
 Y6+EOn7rzR0wbcmj0iJ74TKr+uKDme/Qzm3YEKgGvJPj9tRjTDwxWRBnyTeCMbav
 NkKVwWTep8K+1qJtGVBwACY6iz89u3P8V5owD8O++KIPQa8rA0m8pN5gaU3PVYYQ
 D2UUdqXWIPIFoD4Sveb/WFU8OJKY+Nx7IK8KD03h5tiXW8MmGSa2e5b57gIfCLP7
 DbSHyCkTiqEdBrSM4/RaVVckD6NZ39M87H+iV51vYUkCYmODa/riMj0M7SVMi5Jo
 S/30jvdHEzWnmDBbOsn9d1XbvB5I+zz3BrcZQ2VSyBx+Y9m+SZ9qsyMEkOWw4uKM
 kfFp+XlYVWSnQFo7jY3UUIgVrU9dmdX1WxNX7/2HABDjg3MtWog=
 =tGap
 -----END PGP SIGNATURE-----

Merge tag 'x86_alternatives_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 alternatives updates from Borislav Petkov:
 "This is basically PeterZ's idea to nest the alternative macros to
  avoid the need to "spell out" the number of alternates in an
  ALTERNATIVE_n() macro and thus have an ever-increasing complexity in
  those definitions.

  For ease of bisection, the old macros are converted to the new, nested
  variants in a step-by-step manner so that in case an issue is
  encountered during testing, one can pinpoint the place where it fails
  easier.

  Because debugging alternatives is a serious pain"

* tag 'x86_alternatives_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternatives, kvm: Fix a couple of CALLs without a frame pointer
  x86/alternative: Replace the old macros
  x86/alternative: Convert the asm ALTERNATIVE_3() macro
  x86/alternative: Convert the asm ALTERNATIVE_2() macro
  x86/alternative: Convert the asm ALTERNATIVE() macro
  x86/alternative: Convert ALTERNATIVE_3()
  x86/alternative: Convert ALTERNATIVE_TERNARY()
  x86/alternative: Convert alternative_call_2()
  x86/alternative: Convert alternative_call()
  x86/alternative: Convert alternative_io()
  x86/alternative: Convert alternative_input()
  x86/alternative: Convert alternative_2()
  x86/alternative: Convert alternative()
  x86/alternatives: Add nested alternatives macros
  x86/alternative: Zap alternative_ternary()
2024-07-15 19:11:28 -07:00
Linus Torvalds
1467b49869 - A cleanup and a correction to the error injection driver to inject
a MCA_MISC value only when one has actually been supplied by the user
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaU+QYACgkQEsHwGGHe
 VUp5aBAAh3zED0mymD4Eiutfaue1Ccwa9b8OyvkFLSX+qEuvoHAbh5vZgwGyJum6
 h1cVfw4v43XQE16DVcGXacITjrQfstb67cyj5AKMINZmCvWb+XVVngdavRJs7fsC
 gPWSEtKPWzm+Vu8jk3IYKeAwqLQnlqDlwFp1rGkMyS8KPjBXrnEnh4+Q1QaxXvdF
 XbtTRyKy1PVAA3cgAbW/FFRnZMj8KJQOSyXnYQEmPHTrPAvUufNjXmjCfV27H5/n
 kNKDu+/fY9+Lj8nPrll0j4fXymVaacZvbnUrB2Um3twouyEyBRMHDQL+xRUxhRGO
 iJs+lPBo/hW/+4Vx5BZt6jk4w95ZVldf/sIwxdMk9EDsheGg95WMd2gD1NU0gG+8
 8wGKVYh/eowkuJcNj6EiSvXDeYMeEKPeB9JTButzRM39lAOJYHcJxDaLKjxldR7e
 1HPETsc+tUP0ESuFWT4f1WSZjsc3J4fg89bx3I8gBeOgl5T8ykcpYsTuTdAzdfHM
 xPiC2WTP9WYqnfCUXFSz/yUvaAa1O6TfW2hTqGf+rJMfDBlohk3QBR59JJjO4IkY
 nkYKfc6BcIQtJZ9VQSS26f78P+NKALxReJNS83gPtq4sduDytpLCvLf4ts6dy3d7
 rfr9IL+adutOOy/DjDZ6DVhITGKYjwJEKcAQ3fNBxn0PsSOZUjI=
 =yg+A
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS updates from Borislav Petkov:

 - A cleanup and a correction to the error injection driver to inject a
   MCA_MISC value only when one has actually been supplied by the user

* tag 'ras_core_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Remove unused variable and return value in machine_check_poll()
  x86/mce/inject: Only write MCA_MISC when a value has been supplied
2024-07-15 18:22:48 -07:00
Linus Torvalds
4fd9435641 Updates for timers, timekeeping and related functionality:
- Core:
 
     - Make the takeover of a hrtimer based broadcast timer reliable during
       CPU hot-unplug. The current implementation suffers from a race which
       can lead to broadcast timer starvation in the worst case.
 
     - VDSO related cleanups and simplifications
 
     - Small cleanups and enhancements all over the place
 
   - PTP:
 
     - Replace the architecture specific base clock to clocksource, e.g. ART
       to TSC, conversion function with generic functionality to avoid
       exposing such internals to drivers and convert all existing drivers
       over. This also allows to provide functionality which converts the
       other way round in the core code based on the same parameter set.
 
     - Provide a function to convert CLOCK_REALTIME to the base clock to
       support the upcoming PPS output driver on Intel platforms.
 
   - Drivers:
 
     - A set of Device Tree bindings for new hardware
 
     - Cleanups and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmaUOM0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofolD/9kK+aYdDj1gCFuZXZ2wTgMMxFmf/91
 0UcsGRuBJiIXs3H3iizQ0Mb0cdTW6qZJoBp0jPlvUSm0BEKdEgE1uRX2RuAPZ/Gq
 4/54ZJVopKSgAqeJFmqQubRVSv2XdMRAAJT0o1oUG3jZ0c6u8vqArIh5ZCnu13l/
 tsNOeYLYzQFyA30eHSJ/KjQ2zHwAhJnl5a/b7pdAvxmlN37bGgKEpglv+9zwFiDB
 K/kWbpb/oED9WOmoQy5QYi8iSvLQHEhFGrqzXV3fegu/B/mBBf/bpsisVx7Z1m2R
 nzxNqg86RdMjNR6giwBETZjm7YxM+gKb9nCBNILjbjWZFC4tyrBkLGJ+KniTRNyZ
 M5R4X1oP/14h00qXmCgIEFWysXaJRewYI+TIm8R2rLXrR6Tf3c4oL6fHQJxy3X52
 7A+4Z/vOk/KX6PxYmLC+xQDukhFh2nirVYsP1oNM9yC9zR/wkBBXTTmUSAI+8m8l
 KphniSPS2HMSBI6TtgOT8SKY7lRUZTnafBZq7wRXCv0Zz8AXoofgQDmBkXC99BkB
 MjLvRotJVJvY9a8LtA7htjDg/jiEMa0wHRNAGNSbflKoAKrJzoE5WbFxFZKbq3vZ
 o8cEYRMAIP+X+qn+oymT45XXXQlifZiccJdAi9FqDTvplEib2jmTmH6Ae5Khkr4l
 Lbzh/nSKVN7lOg==
 =8GjP
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Updates for timers, timekeeping and related functionality:

  Core:

   - Make the takeover of a hrtimer based broadcast timer reliable
     during CPU hot-unplug. The current implementation suffers from a
     race which can lead to broadcast timer starvation in the worst
     case.

   - VDSO related cleanups and simplifications

   - Small cleanups and enhancements all over the place

  PTP:

   - Replace the architecture specific base clock to clocksource, e.g.
     ART to TSC, conversion function with generic functionality to avoid
     exposing such internals to drivers and convert all existing drivers
     over. This also allows to provide functionality which converts the
     other way round in the core code based on the same parameter set.

   - Provide a function to convert CLOCK_REALTIME to the base clock to
     support the upcoming PPS output driver on Intel platforms.

  Drivers:

   - A set of Device Tree bindings for new hardware

   - Cleanups and enhancements all over the place"

* tag 'timers-core-2024-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  clocksource/drivers/realtek: Add timer driver for rtl-otto platforms
  dt-bindings: timer: Add schema for realtek,otto-timer
  dt-bindings: timer: Add SOPHGO SG2002 clint
  dt-bindings: timer: renesas,tmu: Add R-Car Gen2 support
  dt-bindings: timer: renesas,tmu: Add RZ/G1 support
  dt-bindings: timer: renesas,tmu: Add R-Mobile APE6 support
  clocksource/drivers/mips-gic-timer: Correct sched_clock width
  clocksource/drivers/mips-gic-timer: Refine rating computation
  clocksource/drivers/sh_cmt: Address race condition for clock events
  clocksource/driver/arm_global_timer: Remove unnecessary ‘0’ values from err
  clocksource/drivers/arm_arch_timer: Remove unnecessary ‘0’ values from irq
  tick/broadcast: Make takeover of broadcast hrtimer reliable
  tick/sched: Combine WARN_ON_ONCE and print_once
  x86/vdso: Remove unused include
  x86/vgtod: Remove unused typedef gtod_long_t
  x86/vdso: Fix function reference in comment
  vdso: Add comment about reason for vdso struct ordering
  vdso/gettimeofday: Clarify comment about open coded function
  timekeeping: Add missing kernel-doc function comments
  tick: Remove unnused tick_nohz_get_idle_calls()
  ...
2024-07-15 15:03:09 -07:00
Linus Torvalds
a5819099f6 Merge branch 'runtime-constants'
Merge runtime constants infrastructure with implementations for x86 and
arm64.

This is one of four branches that came out of me looking at profiles of
my kernel build filesystem load on my 128-core Altra arm64 system, where
pathname walking and the user copies (particularly strncpy_from_user()
for fetching the pathname from user space) is very hot.

This is a very specialized "instruction alternatives" model where the
dentry hash pointer and hash count will be constants for the lifetime of
the kernel, but the allocation are not static but done early during the
kernel boot.  In order to avoid the pointer load and dynamic shift, we
just rewrite the constants in the instructions in place.

We can't use the "generic" alternative instructions infrastructure,
because different architectures do it very differently, and it's
actually simpler to just have very specific helpers, with a fallback to
the generic ("old") model of just using variables for architectures that
do not implement the runtime constant patching infrastructure.

Link: https://lore.kernel.org/all/CAHk-=widPe38fUNjUOmX11ByDckaeEo9tN4Eiyke9u1SAtu9sA@mail.gmail.com/

* runtime-constants:
  arm64: add 'runtime constant' support
  runtime constants: add x86 architecture support
  runtime constants: add default dummy infrastructure
  vfs: dcache: move hashlen_hash() from callers into d_hash()
2024-07-15 08:36:13 -07:00
Thomas Gleixner
b7625d67eb - Remove unnecessary local variables initialization as they will be
initialized in the code path anyway right after on the ARM arch
   timer and the ARM global timer (Li kunyu)
 
 - Fix a race condition in the interrupt leading to a deadlock on the
   SH CMT driver. Note that this fix was not tested on the platform
   using this timer but the fix seems reasonable enough to be picked
   confidently (Niklas Söderlund)
 
 - Increase the rating of the gic-timer and use the configured width
   clocksource register on the MIPS architecture (Jiaxun Yang)
 
 - Add the DT bindings for the TMU on the Renesas platforms (Geert
   Uytterhoeven)
 
 - Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas
   Bonnefille)
 
 - Add the rtl-otto timer driver along with the DT bindings for the
   Realtek platform (Chris Packham)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGn3N4YVz0WNVyHskqDIjiipP6E8FAmaRQh0ACgkQqDIjiipP
 6E+rfQgAqkAWZ9BjswxV8Fg+Hj+a1cSohKjDczqitQF5rJm25X5VvMwlXVa3XQGm
 yemh4tKPpll02LOiYCTyqOWzNrkVS9VsoBd5rrYjRX5aSv7UD35EXklLj4P/INwX
 O9CRGD6aK4Xbw66xxheYHSSh+2iRs2x2mq61+/VdcIBlAwpQo+vx7McRoJZZI+2t
 NFIXw8RF5dDlmmAaqiB0WnPAtcOK3SDo9fu1LEAX1ZAzvbZriLo7XLnL7ibySWVe
 BW1n7Ore6PN5Dvz7jMfTsOQsgAlVv6MPfp/s4EDqMfBLVqXNirzXrdhiee/ahnYP
 vyzQyU5HPCMiIYS45mhJF0OyDd3wyw==
 =wuYA
 -----END PGP SIGNATURE-----

Merge tag 'timers-v6.11-rc1' of https://git.linaro.org/people/daniel.lezcano/linux into timers/core

Pull clocksource/event driver updates from Daniel Lezcano:

  - Remove unnecessary local variables initialization as they will be
    initialized in the code path anyway right after on the ARM arch
    timer and the ARM global timer (Li kunyu)

  - Fix a race condition in the interrupt leading to a deadlock on the
    SH CMT driver. Note that this fix was not tested on the platform
    using this timer but the fix seems reasonable enough to be picked
    confidently (Niklas Söderlund)

  - Increase the rating of the gic-timer and use the configured width
    clocksource register on the MIPS architecture (Jiaxun Yang)

  - Add the DT bindings for the TMU on the Renesas platforms (Geert
    Uytterhoeven)

  - Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas
    Bonnefille)

  - Add the rtl-otto timer driver along with the DT bindings for the
    Realtek platform (Chris Packham)

Link: https://lore.kernel.org/all/91cd05de-4c5d-4242-a381-3b8a4fe6a2a2@linaro.org
2024-07-13 12:07:10 +02:00
Borislav Petkov (AMD)
38918e0bb2 x86/sev: Move SEV compilation units
A long time ago it was agreed upon that the coco stuff needs to go where
it belongs:

  https://lore.kernel.org/all/Yg5nh1RknPRwIrb8@zn.tnic

and not keep it in arch/x86/kernel. TDX did that and SEV can't find time
to do so. So lemme do it. If people have trouble converting their
ongoing featuritis patches, ask me for a sed script.

No functional changes.

Move the instrumentation exclusion bits too, as helpfully caught and
reported by the 0day folks.

Closes: https://lore.kernel.org/oe-kbuild-all/202406220748.hG3qlmDx-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202407091342.46d7dbb-oliver.sang@intel.com
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Ashish Kalra <ashish.kalra@amd.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/r/20240619093014.17962-1-bp@kernel.org
2024-07-11 11:55:58 +02:00
Rafael J. Wysocki
9dabb5b48f Merge back cpufreq material for 6.11. 2024-07-10 13:03:11 +02:00
Daniel Vetter
86634fa4e6 Linux 6.10-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmaB0NweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGkvwH/36UJRk/o6wvXnyH
 E6QjCSWo2226APyWks22NjtC3I/8Iqdvkneuh6wG0qL2sXAB078EMjUq5R81bF8H
 wWFBJwetjYTp8GEyLioMEb2wCH/J3R29dLFC4UYTplafXRGP6//xcpJaKmTxcgdR
 31IzvTPXbApZ7L3k1U6rA2bK9PNKcFCOvZlrNMUCuwMrabymHsDfOUt1DqXyg2xp
 zjqiWYBwlklozmgawSWt/mdEgkWuTcAbg+KyqDVQF59s9aj/OOwZ0j+HACq5V8CM
 quTPIAYL6CC9p7uxa69lGr/sgC0Is/BZLPX7RTZAwCgarGvnX+1HUsjDcaFCtrVg
 O6fPUV8=
 =pgUx
 -----END PGP SIGNATURE-----

Merge v6.10-rc6 into drm-next

The exynos-next pull is based on a newer -rc than drm-next. hence
backmerge first to make sure the unrelated conflicts we accumulated
don't end up randomly in the exynos merge pull, but are separated out.

Conflicts are all benign: Adjacent changes in amdgpu and fbdev-dma
code, and cherry-pick conflict in xe.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2024-07-05 10:47:28 +02:00
Yosry Ahmed
b7c35279e0 x86/mm: Cleanup prctl_enable_tagged_addr() nr_bits error checking
There are two separate checks in prctl_enable_tagged_addr() that nr_bits
is in the correct range. The checks are arranged such the correct case
is sandwiched between both error cases, which do exactly the same thing.

Simplify the if condition and pull the correct case outside with the
rest of the success code path.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20240702132139.3332013-4-yosryahmed%40google.com
2024-07-02 11:33:44 -07:00
Yosry Ahmed
ec225f8c25 x86/mm: Fix LAM inconsistency during context switch
LAM can only be enabled when a process is single-threaded.  But _kernel_
threads can temporarily use a single-threaded process's mm.  That means
that a context-switching kernel thread can race and observe the mm's LAM
metadata (mm->context.lam_cr3_mask) change.

The context switch code does two logical things with that metadata:
populate CR3 and populate 'cpu_tlbstate.lam'.  If it hits this race,
'cpu_tlbstate.lam' and CR3 can end up out of sync.

This de-synchronization is currently harmless.  But it is confusing and
might lead to warnings or real bugs.

Update set_tlbstate_lam_mode() to take in the LAM mask and untag mask
instead of an mm_struct pointer, and while we are at it, rename it to
cpu_tlbstate_update_lam(). This should also make it clearer that we are
updating cpu_tlbstate. In switch_mm_irqs_off(), read the LAM mask once
and use it for both the cpu_tlbstate update and the CR3 update.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20240702132139.3332013-3-yosryahmed%40google.com
2024-07-02 11:32:16 -07:00
Yosry Ahmed
3b299b9955 x86/mm: Use IPIs to synchronize LAM enablement
LAM can only be enabled when a process is single-threaded.  But _kernel_
threads can temporarily use a single-threaded process's mm.

If LAM is enabled by a userspace process while a kthread is using its
mm, the kthread will not observe LAM enablement (i.e.  LAM will be
disabled in CR3). This could be fine for the kthread itself, as LAM only
affects userspace addresses. However, if the kthread context switches to
a thread in the same userspace process, CR3 may or may not be updated
because the mm_struct doesn't change (based on pending TLB flushes). If
CR3 is not updated, the userspace thread will run incorrectly with LAM
disabled, which may cause page faults when using tagged addresses.
Example scenario:

CPU 1                                   CPU 2
/* kthread */
kthread_use_mm()
                                        /* user thread */
                                        prctl_enable_tagged_addr()
                                        /* LAM enabled on CPU 2 */
/* LAM disabled on CPU 1 */
                                        context_switch() /* to CPU 1 */
/* Switching to user thread */
switch_mm_irqs_off()
/* CR3 not updated */
/* LAM is still disabled on CPU 1 */

Synchronize LAM enablement by sending an IPI to all CPUs running with
the mm_struct to enable LAM. This makes sure LAM is enabled on CPU 1
in the above scenario before prctl_enable_tagged_addr() returns and
userspace starts using tagged addresses, and before it's possible to
run the userspace process on CPU 1.

In switch_mm_irqs_off(), move reading the LAM mask until after
mm_cpumask() is updated. This ensures that if an outdated LAM mask is
written to CR3, an IPI is received to update it right after IRQs are
re-enabled.

[ dhansen: Add a LAM enabling helper and comment it ]

Fixes: 82721d8b25 ("x86/mm: Handle LAM on context switch")
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20240702132139.3332013-2-yosryahmed%40google.com
2024-07-02 11:31:51 -07:00
Tony Luck
13488150f5 x86/resctrl: Detect Sub-NUMA Cluster (SNC) mode
There isn't a simple hardware bit that indicates whether a CPU is running in
Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the number of CPUs
sharing the L3 cache with CPU0 to the number of CPUs in the same NUMA node as
CPU0.

Add the missing definition of pr_fmt() to monitor.c. This wasn't noticed
before as there are only "can't happen" console messages from this file.

  [ bp: Massage commit message. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-19-tony.luck@intel.com
2024-07-02 20:02:11 +02:00
Tony Luck
21b362cc76 x86/resctrl: Enable shared RMID mode on Sub-NUMA Cluster (SNC) systems
Hardware has two RMID configuration options for SNC systems. The default
mode divides RMID counters between SNC nodes. E.g. with 200 RMIDs and
two SNC nodes per L3 cache RMIDs 0..99 are used on node 0, and 100..199
on node 1. This isn't compatible with Linux resctrl usage. On this
example system a process using RMID 5 would only update monitor counters
while running on SNC node 0.

The other mode is "RMID Sharing Mode". This is enabled by clearing bit
0 of the RMID_SNC_CONFIG (0xCA0) model specific register. In this mode
the number of logical RMIDs is the number of physical RMIDs (from CPUID
leaf 0xF) divided by the number of SNC nodes per L3 cache instance. A
process can use the same RMID across different SNC nodes.

See the "Intel Resource Director Technology Architecture Specification"
for additional details.

When SNC is enabled, update the MSR when a monitor domain is marked
online. Technically this is overkill. It only needs to be done once
per L3 cache instance rather than per SNC domain. But there is no harm
in doing it more than once, and this is not in a critical path.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240702173820.90368-3-tony.luck@intel.com
2024-07-02 19:57:51 +02:00
Tony Luck
9fbb303ec9 x86/resctrl: Make __mon_event_count() handle sum domains
Legacy resctrl monitor files must provide the sum of event values across
all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance.

There are now two cases:
1) A specific domain is provided in struct rmid_read
   This is either a non-SNC system, or the request is to read data
   from just one SNC node.
2) Domain pointer is NULL. In this case the cacheinfo field in struct
   rmid_read indicates that all SNC nodes that share that L3 cache
   instance should have the event read and return the sum of all
   values.

Update the CPU sanity check. The existing check that an event is read
from a CPU in the requested domain still applies when reading a single
domain. But when summing across domains a more relaxed check that the
current CPU is in the scope of the L3 cache instance is appropriate
since the MSRs to read events are scoped at L3 cache level.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-17-tony.luck@intel.com
2024-07-02 19:57:22 +02:00
Tony Luck
c8c7d3d904 x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter
mon_event_read() fills out most fields of the struct rmid_read that is passed
via an smp_call*() function to a CPU that is part of the correct domain to
read the monitor counters.

With Sub-NUMA Cluster (SNC) mode there are now two cases to handle:

1) Reading a file that returns a value for a single domain.
   + Choose the CPU to execute from the domain cpu_mask

2) Reading a file that must sum across domains sharing an L3 cache
   instance.
   + Indicate to called code that a sum is needed by passing a NULL
     rdt_mon_domain pointer.
   + Choose the CPU from the L3 shared_cpu_map.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-16-tony.luck@intel.com
2024-07-02 19:57:19 +02:00
Tony Luck
6b48b80b08 x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode
In SNC mode, there are multiple subdirectories in each L3 level monitor
directory (one for each SNC node). If all the CPUs in an SNC node are taken
offline, just remove the SNC directory for that node. In non-SNC mode, or when
the last SNC node directory is removed, remove the L3 monitor directory.

Add a helper function to avoid duplicated code.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240702173820.90368-2-tony.luck@intel.com
2024-07-02 19:51:06 +02:00
Tony Luck
0158ed6a13 x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files
When SNC mode is enabled, create subdirectories and files to monitor at the SNC
node granularity. Legacy behavior is preserved by tagging the monitor files at
the L3 granularity with the "sum" attribute.  When the user reads these files
the kernel will read monitor data from all SNC nodes that share the same L3
cache instance and return the aggregated value to the user.

Note that the "domid" field for files that must sum across SNC domains has the
L3 cache instance id, while non-summing files use the domain id.

The "sum" files do not need to make a call to mon_event_read() to initialize
the MBM counters. This will be handled by initializing the individual SNC nodes
that share the L3.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-14-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
92b5d0b118 x86/resctrl: Allocate a new field in union mon_data_bits
When Sub-NUMA Cluster (SNC) mode is enabled, the legacy monitor reporting files
must report the sum of the data from all of the SNC nodes that share the L3
cache that is referenced by the monitor file.

Resctrl squeezes all the attributes of these files into 32 bits so they can be
stored in the "priv" field of struct kernfs_node.

Currently, only three monitor events are defined by enum resctrl_event_id so
reducing it from 8 bits to 7 bits still provides more than enough space to
represent all the known event types.

But note that this choice was arbitrary.  The "rid" field is also far wider
than needed for the current number of resource id types.  This structure is
purely internal to resctrl, no ABI issues with modifying it. Subsequent changes
may rearrange the allocation of bits between each of the fields as needed.

Give the bit to a new "sum" field that indicates that reading this file must
sum across SNC nodes. This bit also indicates that the domid field is the id of
an L3 cache (instead of a domain id) to find which domains must be summed.

Fix up other issues in the kerneldoc description for mon_data_bits.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-13-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
603cf1e288 x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
In Sub-NUMA Cluster (SNC) mode Linux must create the monitor
files in the original "mon_L3_XX" directories and also in each
of the "mon_sub_L3_YY" directories.

Refactor mkdir_mondata_subdir() to move the creation of monitoring files
into a helper function to avoid the need to duplicate code later.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-12-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
587edd7069 x86/resctrl: Initialize on-stack struct rmid_read instances
New semantics rely on some struct rmid_read members having NULL values to
distinguish between the SNC and non-SNC scenarios.  resctrl can thus no longer
rely on this struct not being initialized properly.

Initialize all on-stack declarations of struct rmid_read:

  rdtgroup_mondata_show()
  mbm_update()
  mkdir_mondata_subdir()

to ensure that garbage values from the stack are not passed down to other
functions.

  [ bp: Massage commit message. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-11-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
fb1f51f677 x86/resctrl: Add a new field to struct rmid_read for summation of domains
When a user reads a monitor file rdtgroup_mondata_show() calls mon_event_read()
to package up all the required details into an rmid_read structure which is
passed across the smp_call*() infrastructure to code that will read data from
hardware and return the value (or error status) in the rmid_read structure.

Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require the
smp_call-ed code to sum event data from all domains that share an L3 cache.

Add a pointer to the L3 "cacheinfo" structure to struct rmid_read for the data
collection routines to use to pick the domains to be summed.

  [ Reinette: the rmid_read structure has become complex enough so document each
    of its fields and provide the kerneldoc documentation for struct rmid_read. ]

Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-10-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
328ea68874 x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files
When SNC is enabled, monitoring data is collected at the SNC node granularity,
but must be reported at L3-cache granularity for backwards compatibility in
addition to reporting at the node level.

Add a "ci" field to the rdt_mon_domain structure to save the cache information
about the enclosing L3 cache for the domain.  This provides:

1) The cache id which is needed to compose the name of the legacy monitoring
   directory, and to determine which domains should be summed to provide
   L3-scoped data.

2) The shared_cpu_map which is needed to determine which CPUs can be used to
   read the RMID counters with the MSR interface.

This is the first step to an eventual goal of monitor reporting files like this
(for a system with two SNC nodes per L3):

  $ cd /sys/fs/resctrl/mon_data
  $ tree mon_L3_00
  mon_L3_00			<- 00 here is L3 cache id
  ├── llc_occupancy		\  These files provide legacy support
  ├── mbm_local_bytes		 > for non-SNC aware monitor apps
  ├── mbm_total_bytes		/  that expect data at L3 cache level
  ├── mon_sub_L3_00		<- 00 here is SNC node id
  │   ├── llc_occupancy		\  These files are finer grained
  │   ├── mbm_local_bytes		 > data from each SNC node
  │   └── mbm_total_bytes		/
  └── mon_sub_L3_01
      ├── llc_occupancy		\
      ├── mbm_local_bytes		 > As above, but for node 1.
      └── mbm_total_bytes		/

  [ bp: Massage commit message. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-9-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
ac20aa4230 x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
When SNC is enabled there is a mismatch between the MBA control function
which operates at L3 cache scope and the MBM monitor functions which
measure memory bandwidth on each SNC node.

Block use of the mba_MBps when scopes for MBA/MBM do not match.

Improve user diagnostics by adding invalfc() message when mba_MBps
is not supported.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-8-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
e13db55b5a x86/resctrl: Introduce snc_nodes_per_l3_cache
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

RMID sharing mode divides the physical RMIDs evenly between SNC nodes
but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system
with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two
SNC nodes per L3 cache instance would have 100 logical RMIDs available
for Linux to use. A task running on SNC node 0 with RMID 5 would
accumulate LLC occupancy and MBM bandwidth data in physical RMID 5.
Another task using RMID 5, but running on SNC node 1 would accumulate
data in physical RMID 105.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
how many SNC domains share an L3 cache instance.  Initialize this to
"1". Runtime detection of SNC mode will adjust this value.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
   nodes.
3) Add a function to convert from logical RMID values (assigned to
   tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
   to physical RMID values to load into IA32_QM_EVTSEL MSR when
   reading counters on each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-7-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
1a171608ee x86/resctrl: Add node-scope to the options for feature scope
Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_L3_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are divided between
nodes that share an L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-6-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
cae2bcb6a2 x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-5-tony.luck@intel.com
2024-07-02 19:49:54 +02:00
Tony Luck
cd84f72b6a x86/resctrl: Prepare for different scope for control/monitor operations
Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-4-tony.luck@intel.com
2024-07-02 19:49:53 +02:00
Tony Luck
c103d4d48e x86/resctrl: Prepare to split rdt_domain structure
The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-3-tony.luck@intel.com
2024-07-02 19:49:53 +02:00
Tony Luck
f436cb6913 x86/resctrl: Prepare for new domain scope
Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level,
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid ids
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the call sites.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-2-tony.luck@intel.com
2024-07-02 19:49:53 +02:00
Ard Biesheuvel
37aee82c21 x86/efi: Drop support for fake EFI memory maps
Between kexec and confidential VM support, handling the EFI memory maps
correctly on x86 is already proving to be rather difficult (as opposed
to other EFI architectures which manage to never modify the EFI memory
map to begin with)

EFI fake memory map support is essentially a development hack (for
testing new support for the 'special purpose' and 'more reliable' EFI
memory attributes) that leaked into production code. The regions marked
in this manner are not actually recognized as such by the firmware
itself or the EFI stub (and never have), and marking memory as 'more
reliable' seems rather futile if the underlying memory is just ordinary
RAM.

Marking memory as 'special purpose' in this way is also dubious, but may
be in use in production code nonetheless. However, the same should be
achievable by using the memmap= command line option with the ! operator.

EFI fake memmap support is not enabled by any of the major distros
(Debian, Fedora, SUSE, Ubuntu) and does not exist on other
architectures, so let's drop support for it.

Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-07-02 00:26:24 +02:00
Borislav Petkov (AMD)
0d3db1f14a x86/alternatives, kvm: Fix a couple of CALLs without a frame pointer
objtool complains:

  arch/x86/kvm/kvm.o: warning: objtool: .altinstr_replacement+0xc5: call without frame pointer save/setup
  vmlinux.o: warning: objtool: .altinstr_replacement+0x2eb: call without frame pointer save/setup

Make sure %rSP is an output operand to the respective asm() statements.

The test_cc() hunk and ALT_OUTPUT_SP() courtesy of peterz. Also from him
add some helpful debugging info to the documentation.

Now on to the explanations:

tl;dr: The alternatives macros are pretty fragile.

If I do ALT_OUTPUT_SP(output) in order to be able to package in a %rsp
reference for objtool so that a stack frame gets properly generated, the
inline asm input operand with positional argument 0 in clear_page():

	"0" (page)

gets "renumbered" due to the added

	: "+r" (current_stack_pointer), "=D" (page)

and then gcc says:

  ./arch/x86/include/asm/page_64.h:53:9: error: inconsistent operand constraints in an ‘asm’

The fix is to use an explicit "D" constraint which points to a singleton
register class (gcc terminology) which ends up doing what is expected
here: the page pointer - input and output - should be in the same %rdi
register.

Other register classes have more than one register in them - example:
"r" and "=r" or "A":

  ‘A’
	The ‘a’ and ‘d’ registers.  This class is used for
	instructions that return double word results in the ‘ax:dx’
	register pair.  Single word values will be allocated either in
	‘ax’ or ‘dx’.

so using "D" and "=D" just works in this particular case.

And yes, one would say, sure, why don't you do "+D" but then:

  : "+r" (current_stack_pointer), "+D" (page)
  : [old] "i" (clear_page_orig), [new1] "i" (clear_page_rep), [new2] "i" (clear_page_erms),
  : "cc", "memory", "rax", "rcx")

now find the Waldo^Wcomma which throws a wrench into all this.

Because that silly macro has an "input..." consume-all last macro arg
and in it, one is supposed to supply input *and* clobbers, leading to
silly syntax snafus.

Yap, they need to be cleaned up, one fine day...

Closes: https://lore.kernel.org/oe-kbuild-all/202406141648.jO9qNGLa-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240625112056.GDZnqoGDXgYuWBDUwu@fat_crate.local
2024-07-01 12:41:11 +02:00
Andrew Cooper
34b3fc558b x86/cpu/intel: Drop stray FAM6 check with new Intel CPU model defines
The outer if () should have been dropped when switching to c->x86_vfm.

Fixes: 6568fc18c2 ("x86/cpu/intel: Switch to new Intel CPU model defines")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240529183605.17520-1-andrew.cooper3@citrix.com
2024-06-29 16:10:37 +02:00
Linus Torvalds
093d9603b6 x86: stop playing stack games in profile_pc()
The 'profile_pc()' function is used for timer-based profiling, which
isn't really all that relevant any more to begin with, but it also ends
up making assumptions based on the stack layout that aren't necessarily
valid.

Basically, the code tries to account the time spent in spinlocks to the
caller rather than the spinlock, and while I support that as a concept,
it's not worth the code complexity or the KASAN warnings when no serious
profiling is done using timers anyway these days.

And the code really does depend on stack layout that is only true in the
simplest of cases.  We've lost the comment at some point (I think when
the 32-bit and 64-bit code was unified), but it used to say:

	Assume the lock function has either no stack frame or a copy
	of eflags from PUSHF.

which explains why it just blindly loads a word or two straight off the
stack pointer and then takes a minimal look at the values to just check
if they might be eflags or the return pc:

	Eflags always has bits 22 and up cleared unlike kernel addresses

but that basic stack layout assumption assumes that there isn't any lock
debugging etc going on that would complicate the code and cause a stack
frame.

It causes KASAN unhappiness reported for years by syzkaller [1] and
others [2].

With no real practical reason for this any more, just remove the code.

Just for historical interest, here's some background commits relating to
this code from 2006:

  0cb91a2293 ("i386: Account spinlocks to the caller during profiling for !FP kernels")
  31679f38d8 ("Simplify profile_pc on x86-64")

and a code unification from 2009:

  ef4512882d ("x86: time_32/64.c unify profile_pc")

but the basics of this thing actually goes back to before the git tree.

Link: https://syzkaller.appspot.com/bug?extid=84fe685c02cd112a2ac3 [1]
Link: https://lore.kernel.org/all/CAK55_s7Xyq=nh97=K=G1sxueOFrJDAvPOJAL4TPTCAYvmxO9_A@mail.gmail.com/ [2]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-06-28 14:27:22 -07:00
Josh Poimboeuf
42c141fbb6 x86/bugs: Add 'spectre_bhi=vmexit' cmdline option
In cloud environments it can be useful to *only* enable the vmexit
mitigation and leave syscalls vulnerable.  Add that as an option.

This is similar to the old spectre_bhi=auto option which was removed
with the following commit:

  36d4fe147c ("x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto")

with the main difference being that this has a more descriptive name and
is disabled by default.

Mitigation switch requested by Maksim Davydov <davydov-max@yandex-team.ru>.

  [ bp: Massage. ]

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/2cbad706a6d5e1da2829e5e123d8d5c80330148c.1719381528.git.jpoimboe@kernel.org
2024-06-28 15:35:54 +02:00
Rafael J. Wysocki
b11ec63abe Merge back cpufreq material for v6.11. 2024-06-27 21:20:10 +02:00
Alexey Makhalov
57b7b6acb4 x86/vmware: Add TDX hypercall support
VMware hypercalls use I/O port, VMCALL or VMMCALL instructions.  Add a call to
__tdx_hypercall() in order to support TDX guests.

No change in high bandwidth hypercalls, as only low bandwidth ones are supported
for TDX guests.

  [ bp: Massage, clear on-stack struct tdx_module_args variable. ]

Co-developed-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-9-alexey.makhalov@broadcom.com
2024-06-25 17:15:48 +02:00
Alexey Makhalov
86cb65448d x86/vmware: Correct macro names
VCPU_RESERVED and LEGACY_X2APIC are not VMware hypercall commands.  These are
bits in the return value of the VMWARE_CMD_GETVCPU_INFO command.  Change
VMWARE_CMD_ prefix to GETVCPU_INFO_ one. And move the bit-shift
operation into the macro body.

Fixes: 4cca6ea04d ("x86/apic: Allow x2apic without IR on VMware platform")
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-7-alexey.makhalov@broadcom.com
2024-06-25 17:15:48 +02:00
Alexey Makhalov
b2c13c23ea x86/vmware: Use VMware hypercall API
Remove VMWARE_CMD macro and move to vmware_hypercall API.
No functional changes intended.

Use u32/u64 instead of uint32_t/uint64_t across the file.

Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-6-alexey.makhalov@broadcom.com
2024-06-25 17:15:47 +02:00
Alexey Makhalov
34bf25e820 x86/vmware: Introduce VMware hypercall API
Introduce a vmware_hypercall family of functions. It is a common implementation
to be used by the VMware guest code and virtual device drivers in architecture
independent manner.

The API consists of vmware_hypercallX and vmware_hypercall_hb_{out,in}
set of functions analogous to KVM's hypercall API. Architecture-specific
implementation is hidden inside.

It will simplify future enhancements in VMware hypercalls such as SEV-ES and
TDX related changes without needs to modify a caller in device drivers code.

Current implementation extends an idea from

  bac7b4e843 ("x86/vmware: Update platform detection code for VMCALL/VMMCALL hypercalls")

to have a slow, but safe path vmware_hypercall_slow() earlier during the boot
when alternatives are not yet applied.  The code inherits VMWARE_CMD logic from
the commit mentioned above.

Move common macros from vmware.c to vmware.h.

  [ bp: Fold in a fix:
    https://lore.kernel.org/r/20240625083348.2299-1-alexey.makhalov@broadcom.com ]

Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-2-alexey.makhalov@broadcom.com
2024-06-25 17:01:33 +02:00
Ilpo Järvinen
ec0b4c4d45 x86/of: Return consistent error type from x86_of_pci_irq_enable()
x86_of_pci_irq_enable() returns PCIBIOS_* code received from
pci_read_config_byte() directly and also -EINVAL which are not
compatible error types. x86_of_pci_irq_enable() is used as
(*pcibios_enable_irq) function which should not return PCIBIOS_* codes.

Convert the PCIBIOS_* return code from pci_read_config_byte() into
normal errno using pcibios_err_to_errno().

Fixes: 96e0a0797e ("x86: dtb: Add support for PCI devices backed by dtb nodes")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240527125538.13620-1-ilpo.jarvinen@linux.intel.com
2024-06-24 19:11:31 +02:00
Dexuan Cui
7f828d5fff clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor
In a TDX VM without paravisor, currently the default timer is the Hyper-V
timer, which depends on the slow VM Reference Counter MSR: the Hyper-V TSC
page is not enabled in such a VM because the VM uses Invariant TSC as a
better clocksource and it's challenging to mark the Hyper-V TSC page shared
in very early boot.

Lower the rating of the Hyper-V timer so the local APIC timer becomes the
the default timer in such a VM, and print a warning in case Invariant TSC
is unavailable in such a VM. This change should cause no perceivable
performance difference.

Cc: stable@vger.kernel.org # 6.6+
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20240621061614.8339-1-decui@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240621061614.8339-1-decui@microsoft.com>
2024-06-24 07:06:29 +00:00
Borislav Petkov (AMD)
78ce84b9e0 x86/cpufeatures: Flip the /proc/cpuinfo appearance logic
I'm getting tired of telling people to put a magic "" in the

  #define X86_FEATURE		/* "" ... */

comment to hide the new feature flag from the user-visible
/proc/cpuinfo.

Flip the logic to make it explicit: an explicit "<name>" in the comment
adds the flag to /proc/cpuinfo and otherwise not, by default.

Add the "<name>" of all the existing flags to keep backwards
compatibility with userspace.

There should be no functional changes resulting from this.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240618113840.24163-1-bp@kernel.org
2024-06-20 21:04:22 +02:00
Kees Cook
d6f635bcac x86/alternatives: Make FineIBT mode Kconfig selectable
Since FineIBT performs checking at the destination, it is weaker against
attacks that can construct arbitrary executable memory contents. As such,
some system builders want to run with FineIBT disabled by default. Allow
the "cfi=kcfi" boot param mode to be selectable through Kconfig via the
newly introduced CONFIG_CFI_AUTO_DEFAULT.

Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240501000218.work.998-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-19 12:41:08 -07:00
Linus Torvalds
e3c92e8171 runtime constants: add x86 architecture support
This implements the runtime constant infrastructure for x86, allowing
the dcache d_hash() function to be generated using as a constant for
hash table address followed by shift by a constant of the hash index.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-06-19 12:34:34 -07:00
Dave Martin
739c976579 x86/resctrl: Don't try to free nonexistent RMIDs
Commit

  6791e0ea30 ("x86/resctrl: Access per-rmid structures by index")

adds logic to map individual monitoring groups into a global index space used
for tracking allocated RMIDs.

Attempts to free the default RMID are ignored in free_rmid(), and this works
fine on x86.

With arm64 MPAM, there is a latent bug here however: on platforms with no
monitors exposed through resctrl, each control group still gets a different
monitoring group ID as seen by the hardware, since the CLOSID always forms part
of the monitoring group ID.

This means that when removing a control group, the code may try to free this
group's default monitoring group RMID for real.  If there are no monitors
however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is
never allocated, leading to a splat when free_rmid() tries to dereference the
table.

One option would be to treat RMID 0 as special for every CLOSID, but this would
be ugly since bookkeeping still needs to be done for these monitoring group IDs
when there are monitors present in the hardware.

Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and
just do nothing if the hardware doesn't have monitors.

This fix mirrors the gating checks already present in
mkdir_rdt_prepare_rmid_alloc() and elsewhere.

No functional change on x86.

  [ bp: Massage commit message. ]

Fixes: 6791e0ea30 ("x86/resctrl: Access per-rmid structures by index")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240618140152.83154-1-Dave.Martin@arm.com
2024-06-19 11:39:09 +02:00
Jani Nikula
d754ed2821 Merge drm/drm-next into drm-intel-next
Sync to v6.10-rc3.

Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-06-19 11:38:31 +03:00
Tom Lendacky
99ef9f5984 x86/sev: Allow non-VMPL0 execution when an SVSM is present
To allow execution at a level other than VMPL0, an SVSM must be present.
Allow the SEV-SNP guest to continue booting if an SVSM is detected and
the hypervisor supports the SVSM feature as indicated in the GHCB
hypervisor features bitmap.

  [ bp: Massage a bit. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/2ce7cf281cce1d0cba88f3f576687ef75dc3c953.1717600736.git.thomas.lendacky@amd.com
2024-06-17 20:42:58 +02:00
Tom Lendacky
627dc67151 x86/sev: Extend the config-fs attestation support for an SVSM
When an SVSM is present, the guest can also request attestation reports
from it. These SVSM attestation reports can be used to attest the SVSM
and any services running within the SVSM.

Extend the config-fs attestation support to provide such. This involves
creating four new config-fs attributes:

  - 'service-provider' (input)
    This attribute is used to determine whether the attestation request
    should be sent to the specified service provider or to the SEV
    firmware. The SVSM service provider is represented by the value
    'svsm'.

  - 'service_guid' (input)
    Used for requesting the attestation of a single service within the
    service provider. A null GUID implies that the SVSM_ATTEST_SERVICES
    call should be used to request the attestation report. A non-null
    GUID implies that the SVSM_ATTEST_SINGLE_SERVICE call should be used.

  - 'service_manifest_version' (input)
    Used with the SVSM_ATTEST_SINGLE_SERVICE call, the service version
    represents a specific service manifest version be used for the
    attestation report.

  - 'manifestblob' (output)
    Used to return the service manifest associated with the attestation
    report.

Only display these new attributes when running under an SVSM.

  [ bp: Massage.
   - s/svsm_attestation_call/svsm_attest_call/g ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/965015dce3c76bb8724839d50c5dea4e4b5d598f.1717600736.git.thomas.lendacky@amd.com
2024-06-17 20:42:57 +02:00
Tom Lendacky
eb65f96cb3 virt: sev-guest: Choose the VMPCK key based on executing VMPL
Currently, the sev-guest driver uses the vmpck-0 key by default. When an
SVSM is present, the kernel is running at a VMPL other than 0 and the
vmpck-0 key is no longer available. If a specific vmpck key has not be
requested by the user via the vmpck_id module parameter, choose the
vmpck key based on the active VMPL level.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/b88081c5d88263176849df8ea93e90a404619cab.1717600736.git.thomas.lendacky@amd.com
2024-06-17 20:42:57 +02:00
Tom Lendacky
61564d3468 x86/sev: Provide guest VMPL level to userspace
Requesting an attestation report from userspace involves providing the VMPL
level for the report. Currently any value from 0-3 is valid because Linux
enforces running at VMPL0.

When an SVSM is present, though, Linux will not be running at VMPL0 and only
VMPL values starting at the VMPL level Linux is running at to 3 are valid. In
order to allow userspace to determine the minimum VMPL value that can be
supplied to an attestation report, create a sysfs entry that can be used to
retrieve the current VMPL level of the kernel.

  [ bp: Add CONFIG_SYSFS ifdeffery. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/fff846da0d8d561f9fdaf297dcf8cd907545a25b.1717600736.git.thomas.lendacky@amd.com
2024-06-17 20:42:57 +02:00
Tom Lendacky
1beb348d5c x86/sev: Provide SVSM discovery support
The SVSM specification documents an alternative method of discovery for
the SVSM using a reserved CPUID bit and a reserved MSR. This is intended
for guest components that do not have access to the secrets page in
order to be able to call the SVSM (e.g. UEFI runtime services).

For the MSR support, a new reserved MSR 0xc001f000 has been defined. A #VC
should be generated when accessing this MSR. The #VC handler is expected
to ignore writes to this MSR and return the physical calling area address
(CAA) on reads of this MSR.

While the CPUID leaf is updated, allowing the creation of a CPU feature,
the code will continue to use the VMPL level as an indication of the
presence of an SVSM. This is because the SVSM can be called well before
the CPU feature is in place and a non-zero VMPL requires that an SVSM be
present.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/4f93f10a2ff3e9f368fd64a5920d51bf38d0c19e.1717600736.git.thomas.lendacky@amd.com
2024-06-17 20:42:57 +02:00
Tom Lendacky
d2b2931f19 x86/sev: Use the SVSM to create a vCPU when not in VMPL0
Using the RMPADJUST instruction, the VMSA attribute can only be changed
at VMPL0. An SVSM will be present when running at VMPL1 or a lower
privilege level.

In that case, use the SVSM_CORE_CREATE_VCPU call or the
SVSM_CORE_DESTROY_VCPU call to perform VMSA attribute changes. Use the
VMPL level supplied by the SVSM for the VMSA when starting the AP.

  [ bp: Fix typo + touchups. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/bcdd95ecabe9723673b9693c7f1533a2b8f17781.1717600736.git.thomas.lendacky@amd.com
2024-06-17 20:42:56 +02:00
Tom Lendacky
fcd042e864 x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0
The PVALIDATE instruction can only be performed at VMPL0. If an SVSM is
present, it will be running at VMPL0 while the guest itself is then
running at VMPL1 or a lower privilege level.

In that case, use the SVSM_CORE_PVALIDATE call to perform memory
validation instead of issuing the PVALIDATE instruction directly.

The validation of a single 4K page is now explicitly identified as such
in the function name, pvalidate_4k_page(). The pvalidate_pages()
function is used for validating 1 or more pages at either 4K or 2M in
size. Each function, however, determines whether it can issue the
PVALIDATE directly or whether the SVSM needs to be invoked.

  [ bp: Touchups. ]
  [ Tom: fold in a fix for Coconut SVSM:
    https://lore.kernel.org/r/234bb23c-d295-76e5-a690-7ea68dc1118b@amd.com  ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/4c4017d8b94512d565de9ccb555b1a9f8983c69c.1717600736.git.thomas.lendacky@amd.com
2024-06-17 20:37:54 +02:00
Kirill A. Shutemov
1ceebe2e46 x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
MADT Multiprocessor Wakeup structure version 1 brings support for CPU offlining:
BIOS provides a reset vector where the CPU has to jump to for offlining itself.
The new TEST mailbox command can be used to test whether the CPU offlined itself
which means the BIOS has control over the CPU and can online it again via the
ACPI MADT wakeup method.

Add CPU offlining support for the ACPI MADT wakeup method by implementing custom
cpu_die(), play_dead() and stop_this_cpu() SMP operations.

CPU offlining makes it possible to hand over secondary CPUs over kexec, not
limiting the second kernel to a single CPU.

The change conforms to the approved ACPI spec change proposal. See the Link.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
Link: https://lore.kernel.org/r/20240614095904.1345461-19-kirill.shutemov@linux.intel.com
2024-06-17 17:46:25 +02:00
Kirill A. Shutemov
26ba7353ca x86/smp: Add smp_ops.stop_this_cpu() callback
If the helper is defined, it is called instead of halt() to stop the CPU at the
end of stop_this_cpu() and on crash CPU shutdown.

ACPI MADT will use it to hand over the CPU to BIOS in order to be able to wake
it up again after kexec.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-17-kirill.shutemov@linux.intel.com
2024-06-17 17:46:20 +02:00
Kirill A. Shutemov
db0936830a x86/acpi: Do not attempt to bring up secondary CPUs in the kexec case
ACPI MADT doesn't allow to offline a CPU after it was onlined. This limits
kexec: the second kernel won't be able to use more than one CPU.

To prevent a kexec kernel from onlining secondary CPUs, invalidate the mailbox
address in the ACPI MADT wakeup structure which prevents a kexec kernel to use
it.

This is safe as the booting kernel has the mailbox address cached already and
acpi_wakeup_cpu() uses the cached value to bring up the secondary CPUs.

Note: This is a Linux specific convention and not covered by the ACPI
specification.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-16-kirill.shutemov@linux.intel.com
2024-06-17 17:46:17 +02:00
Kirill A. Shutemov
6630cbce7c x86/acpi: Rename fields in the acpi_madt_multiproc_wakeup structure
In order to support MADT wakeup structure version 1, provide more appropriate
names for the fields in the structure.

Rename 'mailbox_version' to 'version'. This field signifies the version of the
structure and the related protocols, rather than the version of the mailbox.
This field has not been utilized in the code thus far.

Rename 'base_address' to 'mailbox_address' to clarify the kind of address it
represents. In version 1, the structure includes the reset vector address. Clear
and distinct naming helps to prevent any confusion.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-15-kirill.shutemov@linux.intel.com
2024-06-17 17:46:15 +02:00
Kirill A. Shutemov
06fa48d85b x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI ranges
e820__end_of_ram_pfn() is used to calculate max_pfn which, among other things,
guides where direct mapping ends. Any memory above max_pfn is not going to be
present in the direct mapping.

e820__end_of_ram_pfn() finds the end of the RAM based on the highest
E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into
calculation.

Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI tables
and might be required by kernel to function properly.

Usually the problem is hidden because there is some E820_TYPE_RAM memory above
E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash memory as
E820_TYPE_RAM on boot. If the pre-allocated range is small, it can fit under the
last E820_TYPE_ACPI range.

Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover
E820_TYPE_ACPI memory.

The problem was discovered during debugging kexec for TDX guest. TDX guest uses
E820_TYPE_ACPI to store the unaccepted memory bitmap and pass it between the
kernels on kexec.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-13-kirill.shutemov@linux.intel.com
2024-06-17 17:46:08 +02:00
Kirill A. Shutemov
22daa42294 x86/mm: Add callbacks to prepare encrypted memory for kexec
AMD SEV and Intel TDX guests allocate shared buffers for performing I/O.
This is done by allocating pages normally from the buddy allocator and
then converting them to shared using set_memory_decrypted().

On kexec, the second kernel is unaware of which memory has been
converted in this manner. It only sees E820_TYPE_RAM. Accessing shared
memory as private is fatal.

Therefore, the memory state must be reset to its original state before
starting the new kernel with kexec.

The process of converting shared memory back to private occurs in two
steps:

- enc_kexec_begin() stops new conversions.

- enc_kexec_finish() unshares all existing shared memory, reverting it
  back to private.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-11-kirill.shutemov@linux.intel.com
2024-06-17 17:46:02 +02:00
Kirill A. Shutemov
99c5c4c60e x86/mm: Make x86_platform.guest.enc_status_change_*() return an error
TDX is going to have more than one reason to fail enc_status_change_prepare().

Change the callback to return errno instead of assuming -EIO. Change
enc_status_change_finish() too to keep the interface symmetric.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-8-kirill.shutemov@linux.intel.com
2024-06-17 17:45:53 +02:00
Kirill A. Shutemov
de60613173 x86/kexec: Keep CR4.MCE set during kexec for TDX guest
TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
that bit is cleared during CR4 register reprogramming during boot or kexec
flows, a #VE exception will be raised which the guest kernel cannot handle.

Therefore, make sure the CR4.MCE setting is preserved over kexec too and avoid
raising any #VEs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240614095904.1345461-7-kirill.shutemov@linux.intel.com
2024-06-17 17:45:50 +02:00
Borislav Petkov
7b46a8997d x86/relocate_kernel: Use named labels for less confusion
That identity_mapped() function was loving that "1" label to the point of
completely confusing its readers.

Use named labels in each place for clarity.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-6-kirill.shutemov@linux.intel.com
2024-06-17 17:45:46 +02:00
Kirill A. Shutemov
66e48e491d cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
ACPI MADT doesn't allow to offline a CPU after it has been woken up.

Currently, CPU hotplug is prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible user of
the wake up method. Any platform that uses ACPI MADT wakeup method cannot
offline CPU.

Disable CPU offlining on ACPI MADT wakeup enumeration.

This has no visible effects for users: currently, TDX guest is the only platform
that uses the ACPI MADT wakeup method.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-5-kirill.shutemov@linux.intel.com
2024-06-17 17:45:34 +02:00
Kirill A. Shutemov
24dd05da8c x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init
acpi_mp_wake_mailbox_paddr and acpi_mp_wake_mailbox are initialized once during
ACPI MADT init and never changed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-3-kirill.shutemov@linux.intel.com
2024-06-17 17:45:28 +02:00
Kirill A. Shutemov
2b5e22afae x86/acpi: Extract ACPI MADT wakeup code into a separate file
In order to prepare for the expansion of support for the ACPI MADT
wakeup method, move the relevant code into a separate file.

Introduce a new configuration option to clearly indicate dependencies
without the use of ifdefs.

There have been no functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-2-kirill.shutemov@linux.intel.com
2024-06-17 17:45:25 +02:00
Nikolay Borisov
54183d103d x86/kexec: Remove spurious unconditional JMP from from identity_mapped()
This seemingly straightforward JMP was introduced in the initial version
of the the 64bit kexec code without any explanation.

It turns out (check accompanying Link) it's likely a copy/paste artefact
from 32-bit code, where such a JMP could be used as a serializing
instruction for the 486's prefetch queue. On x86_64 that's not needed
because there's already a preceding write to cr4 which itself is
a serializing operation.

  [ bp: Typos. Let's try this and see what cries out. If it does,
    reverting it is trivial. ]

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/55bc0649-c017-49ab-905d-212f140a403f@citrix.com/
2024-06-17 17:45:19 +02:00
Mateusz Guzik
501bd734f9 x86/CPU/AMD: Always inline amd_clear_divider()
The routine is used on syscall exit and on non-AMD CPUs is guaranteed to
be empty.

It probably does not need to be a function call even on CPUs which do need the
mitigation.

  [ bp: Make sure it is always inlined so that noinstr marking works. ]

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613082637.659133-1-mjguzik@gmail.com
2024-06-13 14:40:29 +02:00
Yazen Ghannam
dc5243921b x86/amd_nb: Enhance SMN access error checking
AMD Zen-based systems use a System Management Network (SMN) that
provides access to implementation-specific registers.

SMN accesses are done indirectly through an index/data pair in PCI
config space. The accesses can fail for a variety of reasons.

Include code comments to describe some possible scenarios.

Require error checking for callers of amd_smn_read() and amd_smn_write().
This is needed because many error conditions cannot be checked by these
functions.

  [ bp: Touchup comment. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20240606-fix-smn-bad-read-v4-4-ffde21931c3f@amd.com
2024-06-12 11:38:58 +02:00
Jiri Olsa
ff474a78ce uprobe: Add uretprobe syscall to speed up return probe
Adding uretprobe syscall instead of trap to speed up return probe.

At the moment the uretprobe setup/path is:

  - install entry uprobe

  - when the uprobe is hit, it overwrites probed function's return address
    on stack with address of the trampoline that contains breakpoint
    instruction

  - the breakpoint trap code handles the uretprobe consumers execution and
    jumps back to original return address

This patch replaces the above trampoline's breakpoint instruction with new
ureprobe syscall call. This syscall does exactly the same job as the trap
with some more extra work:

  - syscall trampoline must save original value for rax/r11/rcx registers
    on stack - rax is set to syscall number and r11/rcx are changed and
    used by syscall instruction

  - the syscall code reads the original values of those registers and
    restore those values in task's pt_regs area

  - only caller from trampoline exposed in '[uprobes]' is allowed,
    the process will receive SIGILL signal otherwise

Even with some extra work, using the uretprobes syscall shows speed
improvement (compared to using standard breakpoint):

  On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)

  current:
    uretprobe-nop  :    1.498 ± 0.000M/s
    uretprobe-push :    1.448 ± 0.001M/s
    uretprobe-ret  :    0.816 ± 0.001M/s

  with the fix:
    uretprobe-nop  :    1.969 ± 0.002M/s  < 31% speed up
    uretprobe-push :    1.910 ± 0.000M/s  < 31% speed up
    uretprobe-ret  :    0.934 ± 0.000M/s  < 14% speed up

  On Amd (AMD Ryzen 7 5700U)

  current:
    uretprobe-nop  :    0.778 ± 0.001M/s
    uretprobe-push :    0.744 ± 0.001M/s
    uretprobe-ret  :    0.540 ± 0.001M/s

  with the fix:
    uretprobe-nop  :    0.860 ± 0.001M/s  < 10% speed up
    uretprobe-push :    0.818 ± 0.001M/s  < 10% speed up
    uretprobe-ret  :    0.578 ± 0.000M/s  <  7% speed up

The performance test spawns a thread that runs loop which triggers
uprobe with attached bpf program that increments the counter that
gets printed in results above.

The uprobe (and uretprobe) kind is determined by which instruction
is being patched with breakpoint instruction. That's also important
for uretprobes, because uprobe is installed for each uretprobe.

The performance test is part of bpf selftests:
  tools/testing/selftests/bpf/run_bench_uprobes.sh

Note at the moment uretprobe syscall is supported only for native
64-bit process, compat process still uses standard breakpoint.

Note that when shadow stack is enabled the uretprobe syscall returns
via iret, which is slower than return via sysret, but won't cause the
shadow stack violation.

Link: https://lore.kernel.org/all/20240611112158.40795-4-jolsa@kernel.org/

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-06-12 08:44:28 +09:00
Jiri Olsa
1713b63a07 x86/shstk: Make return uprobe work with shadow stack
Currently the application with enabled shadow stack will crash
if it sets up return uprobe. The reason is the uretprobe kernel
code changes the user space task's stack, but does not update
shadow stack accordingly.

Adding new functions to update values on shadow stack and using
them in uprobe code to keep shadow stack in sync with uretprobe
changes to user stack.

Link: https://lore.kernel.org/all/20240611112158.40795-2-jolsa@kernel.org/

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 488af8ea71 ("x86/shstk: Wire in shadow stack interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-06-12 08:44:27 +09:00
Perry Yuan
c7107750b2 x86/cpufeatures: Add AMD FAST CPPC feature flag
Some AMD Zen 4 processors support a new feature FAST CPPC which
allows for a faster CPPC loop due to internal architectural
enhancements. The goal of this faster loop is higher performance
at the same power consumption.

Reference:
See the page 99 of PPR for AMD Family 19h Model 61h rev.B1, docID 56713

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Xiaojian Du <Xiaojian.Du@amd.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-06-11 16:12:12 -05:00
Borislav Petkov (AMD)
93694129c6 x86/alternative: Convert ALTERNATIVE_3()
Zap the hack of using an ALTERNATIVE_3() internal label, as suggested by
bgerst:

  https://lore.kernel.org/r/CAMzpN2i4oJ-Dv0qO46Fd-DxNv5z9=x%2BvO%2B8g=47NiiAf8QEJYA@mail.gmail.com

in favor of a label local to this macro only, as it should be done.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240607111701.8366-11-bp@kernel.org
2024-06-11 18:22:41 +02:00
Peter Zijlstra
d2a793dae2 x86/alternatives: Add nested alternatives macros
Instead of making increasingly complicated ALTERNATIVE_n()
implementations, use a nested alternative expression.

The only difference between:

  ALTERNATIVE_2(oldinst, newinst1, flag1, newinst2, flag2)

and

  ALTERNATIVE(ALTERNATIVE(oldinst, newinst1, flag1),
              newinst2, flag2)

is that the outer alternative can add additional padding when the inner
alternative is the shorter one, which then results in
alt_instr::instrlen being inconsistent.

However, this is easily remedied since the alt_instr entries will be
consecutive and it is trivial to compute the max(alt_instr::instrlen) at
runtime while patching.

Specifically, after this the ALTERNATIVE_2 macro, after CPP expansion
(and manual layout), looks like this:

  .macro ALTERNATIVE_2 oldinstr, newinstr1, ft_flags1, newinstr2, ft_flags2
  740:
  740: \oldinstr ;
  741: .skip -(((744f-743f)-(741b-740b)) > 0) * ((744f-743f)-(741b-740b)),0x90 ;
  742: .pushsection .altinstructions,"a" ;
  	altinstr_entry 740b,743f,\ft_flags1,742b-740b,744f-743f ;
  .popsection ;
  .pushsection .altinstr_replacement,"ax" ;
  743: \newinstr1 ;
  744: .popsection ; ;
  741: .skip -(((744f-743f)-(741b-740b)) > 0) * ((744f-743f)-(741b-740b)),0x90 ;
  742: .pushsection .altinstructions,"a" ;
  altinstr_entry 740b,743f,\ft_flags2,742b-740b,744f-743f ;
  .popsection ;
  .pushsection .altinstr_replacement,"ax" ;
  743: \newinstr2 ;
  744: .popsection ;
  .endm

The only label that is ambiguous is 740, however they all reference the
same spot, so that doesn't matter.

NOTE: obviously only @oldinstr may be an alternative; making @newinstr
an alternative would mean patching .altinstr_replacement which very
likely isn't what is intended, also the labels will be confused in that
case.

  [ bp: Debug an issue where it would match the wrong two insns and
    and consider them nested due to the same signed offsets in the
    .alternative section and use instr_va() to compare the full virtual
    addresses instead.

    - Use new labels to denote that the new, nested
    alternatives are being used when staring at preprocessed output.

    - Use the %c constraint everywhere instead of %P and document the
      difference for future reference. ]

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230628104952.GA2439977@hirez.programming.kicks-ass.net
2024-06-11 17:13:08 +02:00
Tom Lendacky
34ff659017 x86/sev: Use kernel provided SVSM Calling Areas
The SVSM Calling Area (CA) is used to communicate between Linux and the
SVSM. Since the firmware supplied CA for the BSP is likely to be in
reserved memory, switch off that CA to a kernel provided CA so that access
and use of the CA is available during boot. The CA switch is done using
the SVSM core protocol SVSM_CORE_REMAP_CA call.

An SVSM call is executed by filling out the SVSM CA and setting the proper
register state as documented by the SVSM protocol. The SVSM is invoked by
by requesting the hypervisor to run VMPL0.

Once it is safe to allocate/reserve memory, allocate a CA for each CPU.
After allocating the new CAs, the BSP will switch from the boot CA to the
per-CPU CA. The CA for an AP is identified to the SVSM when creating the
VMSA in preparation for booting the AP.

  [ bp: Heavily simplify svsm_issue_call() asm, other touchups. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/fa8021130bcc3bcf14d722a25548cb0cdf325456.1717600736.git.thomas.lendacky@amd.com
2024-06-11 07:22:46 +02:00
Tom Lendacky
878e70dbd2 x86/sev: Check for the presence of an SVSM in the SNP secrets page
During early boot phases, check for the presence of an SVSM when running
as an SEV-SNP guest.

An SVSM is present if not running at VMPL0 and the 64-bit value at offset
0x148 into the secrets page is non-zero. If an SVSM is present, save the
SVSM Calling Area address (CAA), located at offset 0x150 into the secrets
page, and set the VMPL level of the guest, which should be non-zero, to
indicate the presence of an SVSM.

  [ bp: Touchups. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/9d3fe161be93d4ea60f43c2a3f2c311fe708b63b.1717600736.git.thomas.lendacky@amd.com
2024-06-11 07:22:46 +02:00
Tony Luck
f385f02463 x86/resctrl: Replace open coded cacheinfo searches
pseudo_lock_region_init() and rdtgroup_cbm_to_size() open code a search for
details of a particular cache level.

Replace with get_cpu_cacheinfo_level().

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240610003927.341707-5-tony.luck@intel.com
2024-06-10 08:50:12 +02:00
Yazen Ghannam
c625dabbf1 x86/amd_nb: Check for invalid SMN reads
AMD Zen-based systems use a System Management Network (SMN) that
provides access to implementation-specific registers.

SMN accesses are done indirectly through an index/data pair in PCI
config space. The PCI config access may fail and return an error code.
This would prevent the "read" value from being updated.

However, the PCI config access may succeed, but the return value may be
invalid. This is in similar fashion to PCI bad reads, i.e. return all
bits set.

Most systems will return 0 for SMN addresses that are not accessible.
This is in line with AMD convention that unavailable registers are
Read-as-Zero/Writes-Ignored.

However, some systems will return a "PCI Error Response" instead. This
value, along with an error code of 0 from the PCI config access, will
confuse callers of the amd_smn_read() function.

Check for this condition, clear the return value, and set a proper error
code.

Fixes: ddfe43cdc0 ("x86/amd_nb: Add SMN and Indirect Data Fabric access for AMD Fam17h")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230403164244.471141-1-yazen.ghannam@amd.com
2024-06-05 21:23:34 +02:00
David Kaplan
93c1800b37 x86/kexec: Fix bug with call depth tracking
The call to cc_platform_has() triggers a fault and system crash if call depth
tracking is active because the GS segment has been reset by load_segments() and
GS_BASE is now 0 but call depth tracking uses per-CPU variables to operate.

Call cc_platform_has() earlier in the function when GS is still valid.

  [ bp: Massage. ]

Fixes: 5d8213864a ("x86/retbleed: Add SKL return thunk")
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20240603083036.637-1-bp@kernel.org
2024-06-03 17:19:03 +02:00
Lakshmi Sowjanya D
0f532a789f x86/tsc: Remove obsolete ART to TSC conversion functions
convert_art_to_tsc() and convert_art_ns_to_tsc() interfaces are no
longer required. The conversion is now handled by the core code.

Signed-off-by: Lakshmi Sowjanya D <lakshmi.sowjanya.d@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240513103813.5666-9-lakshmi.sowjanya.d@intel.com
2024-06-03 11:18:51 +02:00
Lakshmi Sowjanya D
3a52886c8f x86/tsc: Provide ART base clock information for TSC
The core code provides a new mechanism to allow conversion between ART and
TSC. This allows to replace the x86 specific ART/TSC conversion functions.

Prepare for removal by filling in the base clock conversion information for
ART and associating the base clock to the TSC clocksource.

The existing conversion functions will be removed once the usage sites are
converted over to the new model.

[ tglx: Massaged change log ]

Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Christopher S. Hall <christopher.s.hall@intel.com>
Signed-off-by: Christopher S. Hall <christopher.s.hall@intel.com>
Signed-off-by: Lakshmi Sowjanya D <lakshmi.sowjanya.d@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240513103813.5666-3-lakshmi.sowjanya.d@intel.com
2024-06-03 11:18:50 +02:00
Linus Torvalds
a693b9c95a Miscellaneous topology parsing fixes:
- Fix topology parsing regression on older CPUs in the
    new AMD/Hygon parser
 
  - Fix boot crash on odd Intel Quark and similar CPUs that
    do not fill out cpuinfo_x86::x86_clflush_size and zero out
    cpuinfo_x86::x86_cache_alignment as a result. Provide
    32 bytes as a general fallback value.
 
  - Fix topology enumeration on certain rare CPUs where the
    BIOS locks certain CPUID leaves and the kernel unlocked
    them late, which broke with the new topology parsing code.
    Factor out this unlocking logic and move it earlier
    in the parsing sequence.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZcHdcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i5tQ/9G1ckVgGEKvDPwGcUi9Db9+2UzsWfB0og
 kUYgBJDq/sp0ZXPj/RB3M9h3YKmmsOuL4ZUJz3hrqQt1MqEx7eVNUbFuFRoE2ojx
 MimGI/L1pvBrJb9grpULrMX8aDND6hC1OQYOrUEN/yOTPxth77fGJIhcc/plSbAZ
 po1S12uOONxX1EvKlS/B0k6zYqBUWYTzkMog/YSa/TjXez9A/yJqt5dcNAyEdSrq
 EbjSF/7warhFGmiuFDC2z8rvnrwZ/qT5cOlkHkHs8JSigDchYT/gctWv2bQPCavS
 Nw/Aoue7TfxYu9F2H0PaqcA3efSNKmfcuozX0PNLswMGrBc4HoVoVdu3ldigOPhm
 lj4M0zEPkzRFuGvrBdsbm+oewzDOK+jr+QYyy0R+HU48vz0RpoVKpWfOqI9fjfQt
 9m2nuKLLd4mOEwnRLtCdfQzggksIJoV0soHH6yR+32cqqb9t82tICF5caPsdQYzE
 /zH/onXkaiz5Rn4vL7em7vcAE1RvL97b8iU435Hnta6Lboi3FxJepxGt5ZRsGCZQ
 ukV5iEAkRQRNjrvaC2QT8jNmBQ0f73UBixn0iB7CKtGReteP3gn4svHfvkhVlZVN
 Qpw2HvCm+LlpX7+U8EvzzqETNg5CYY46pE4nUNsHr+/zQEFFOER6MNW5rJDDMWAl
 QdVvI4HhS8Y=
 =ugOt
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Miscellaneous topology parsing fixes:

   - Fix topology parsing regression on older CPUs in the new AMD/Hygon
     parser

   - Fix boot crash on odd Intel Quark and similar CPUs that do not fill
     out cpuinfo_x86::x86_clflush_size and zero out
     cpuinfo_x86::x86_cache_alignment as a result.

     Provide 32 bytes as a general fallback value.

   - Fix topology enumeration on certain rare CPUs where the BIOS locks
     certain CPUID leaves and the kernel unlocked them late, which broke
     with the new topology parsing code. Factor out this unlocking logic
     and move it earlier in the parsing sequence"

* tag 'x86-urgent-2024-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/topology/intel: Unlock CPUID before evaluating anything
  x86/cpu: Provide default cache line size if not enumerated
  x86/topology/amd: Evaluate SMT in CPUID leaf 0x8000001e only on family 0x17 and greater
2024-06-02 09:32:34 -07:00
Linus Torvalds
3fca58ffad Export a symbol to make life easier for instrumentation/debugging.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZcGuARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iDJw//YwjUCBQTmzKDgahXy8I1BX4ndcIrS/FW
 eSUN/17zYac9sDe3db6Exr+PddoLYIc2vtQ3AQFtuZrYEhGoItNVIoDedwrSvDeC
 NHOUKTgI6vO/eGCINUVotvA1Rzgcl7Bq04YPGXmIzMyNCsVlbBzo/vW4OiNNHaSw
 iP0cI6D/dHcWr94uYN9vnBO1G/A0ixDhM3KiZCJwib5rw60rDeoerdScH34IRPlE
 Wfn6jFD6b6Z5fUjPvbizzD8T+MI85AIasznB9TnkJOuKlKW0pVJNU9HVqmEvV/Yd
 JTtDUekM5SNuL5PFyn0pkVq3ZYIxeY0LU7afFVFwgZ4t4VwQVeyobvjX7a2S2r3l
 alCFaFE2j/CHcUYyAmXPON8tcN98pupSnPSsv2oYKErUrEFFLEwTKdQMzNn5Jfqz
 fWAwD4h+WH+2y9HZYs0I34a2ssbcU3l5TdDFPHpNxa4Zmt0eQxN7ihelDWKECZTk
 7oH+lZYoHySG4KxL2ppMRAcHOKDB61UJnlQvGVYl6QpnrrnxmR0kwkP+OQZPQVhH
 DEgues/lGYqqyLOIZnq+2ciTjSmRQCkhfRdSC+btiMx6hXuBVhlUOW4YZoRyPUwp
 31I/XAOchcqee1Wt4+Z1dqhDDtRAzmau04xXZtq5GkgGjavpSbzAFCRCpfhCh2xh
 plMLErWFk5E=
 =tTwc
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2024-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Export a symbol to make life easier for instrumentation/debugging"

* tag 'sched-urgent-2024-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/x86: Export 'percpu arch_freq_scale'
2024-06-02 09:23:35 -07:00
Jeff Johnson
eb9d3c0bb0 x86/mce/inject: Add missing MODULE_DESCRIPTION() line
make W=1 C=1 warns:

  WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kernel/cpu/mce/mce-inject.o

Add the missing MODULE_DESCRIPTION().

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240530-md-x86-mce-inject-v1-1-2a9dc998f709@quicinc.com
2024-06-02 09:05:02 +02:00
Thomas Gleixner
0c2f6d0461 x86/topology/intel: Unlock CPUID before evaluating anything
Intel CPUs have a MSR bit to limit CPUID enumeration to leaf two. If
this bit is set by the BIOS then CPUID evaluation including topology
enumeration does not work correctly as the evaluation code does not try
to analyze any leaf greater than two.

This went unnoticed before because the original topology code just
repeated evaluation several times and managed to overwrite the initial
limited information with the correct one later. The new evaluation code
does it once and therefore ends up with the limited and wrong
information.

Cure this by unlocking CPUID right before evaluating anything which
depends on the maximum CPUID leaf being greater than two instead of
rereading stuff after unlock.

Fixes: 22d63660c3 ("x86/cpu: Use common topology code for Intel")
Reported-by: Peter Schneider <pschneider1968@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Peter Schneider <pschneider1968@googlemail.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/fd3f73dc-a86f-4bcf-9c60-43556a21eb42@googlemail.com
2024-05-31 20:25:56 +02:00
Jani Nikula
aef8dc4398 drm: move i915_pciids.h under include/drm/intel
Clean up the top level include/drm directory by grouping all the Intel
specific files under a common subdirectory.

Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/a19cebc0f03588b9627dcaaebe69a9fef28c27f0.1717075103.git.jani.nikula@intel.com
2024-05-31 16:11:29 +03:00
Jani Nikula
03c7918d0d drm: move i915_drm.h under include/drm/intel
Clean up the top level include/drm directory by grouping all the Intel
specific files under a common subdirectory.

v2: Also fix comment in intel_pci_config.h (Ilpo)

Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Acked-by: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/0e344a72e9be596ac2b8b55a26fd674a96f03cdc.1717075103.git.jani.nikula@intel.com
2024-05-31 16:11:09 +03:00
Phil Auld
d40605a682 sched/x86: Export 'percpu arch_freq_scale'
Commit:

  7bc263840b ("sched/topology: Consolidate and clean up access to a CPU's max compute capacity")

removed rq->cpu_capacity_orig in favor of using arch_scale_freq_capacity()
calls. Export the underlying percpu symbol on x86 so that external trace
point helper modules can be made to work again.

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240530181548.2039216-1-pauld@redhat.com
2024-05-31 11:48:42 +02:00
Dave Hansen
2a38e4ca30 x86/cpu: Provide default cache line size if not enumerated
tl;dr: CPUs with CPUID.80000008H but without CPUID.01H:EDX[CLFSH]
will end up reporting cache_line_size()==0 and bad things happen.
Fill in a default on those to avoid the problem.

Long Story:

The kernel dies a horrible death if c->x86_cache_alignment (aka.
cache_line_size() is 0.  Normally, this value is populated from
c->x86_clflush_size.

Right now the code is set up to get c->x86_clflush_size from two
places.  First, modern CPUs get it from CPUID.  Old CPUs that don't
have leaf 0x80000008 (or CPUID at all) just get some sane defaults
from the kernel in get_cpu_address_sizes().

The vast majority of CPUs that have leaf 0x80000008 also get
->x86_clflush_size from CPUID.  But there are oddballs.

Intel Quark CPUs[1] and others[2] have leaf 0x80000008 but don't set
CPUID.01H:EDX[CLFSH], so they skip over filling in ->x86_clflush_size:

	cpuid(0x00000001, &tfms, &misc, &junk, &cap0);
	if (cap0 & (1<<19))
		c->x86_clflush_size = ((misc >> 8) & 0xff) * 8;

So they: land in get_cpu_address_sizes() and see that CPUID has level
0x80000008 and jump into the side of the if() that does not fill in
c->x86_clflush_size.  That assigns a 0 to c->x86_cache_alignment, and
hilarity ensues in code like:

        buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
                         GFP_KERNEL);

To fix this, always provide a sane value for ->x86_clflush_size.

Big thanks to Andy Shevchenko for finding and reporting this and also
providing a first pass at a fix. But his fix was only partial and only
worked on the Quark CPUs.  It would not, for instance, have worked on
the QEMU config.

1. https://raw.githubusercontent.com/InstLatx64/InstLatx64/master/GenuineIntel/GenuineIntel0000590_Clanton_03_CPUID.txt
2. You can also get this behavior if you use "-cpu 486,+clzero"
   in QEMU.

[ dhansen: remove 'vp_bits_from_cpuid' reference in changelog
	   because bpetkov brutally murdered it recently. ]

Fixes: fbf6449f84 ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Jörn Heusipp <osmanx@heusipp.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240516173928.3960193-1-andriy.shevchenko@linux.intel.com/
Link: https://lore.kernel.org/lkml/5e31cad3-ad4d-493e-ab07-724cfbfaba44@heusipp.de/
Link: https://lore.kernel.org/all/20240517200534.8EC5F33E%40davehans-spike.ostc.intel.com
2024-05-30 08:29:45 -07:00
Thomas Gleixner
34bf6bae32 x86/topology/amd: Evaluate SMT in CPUID leaf 0x8000001e only on family 0x17 and greater
The new AMD/HYGON topology parser evaluates the SMT information in CPUID leaf
0x8000001e unconditionally while the original code restricted it to CPUs with
family 0x17 and greater.

This breaks family 0x15 CPUs which advertise that leaf and have a non-zero
value in the SMT section. The machine boots, but the scheduler complains loudly
about the mismatch of the core IDs:

  WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:6482 sched_cpu_starting+0x183/0x250
  WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2408 build_sched_domains+0x76b/0x12b0

Add the condition back to cure it.

  [ bp: Make it actually build because grandpa is not concerned with
    trivial stuff. :-P ]

Fixes: f7fb3b2dd9 ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Closes: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/issues/56
Reported-by: Tim Teichmann <teichmanntim@outlook.de>
Reported-by: Christian Heusel <christian@heusel.eu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tim Teichmann <teichmanntim@outlook.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/7skhx6mwe4hxiul64v6azhlxnokheorksqsdbp7qw6g2jduf6c@7b5pvomauugk
2024-05-30 15:58:55 +02:00
Tony Luck
6568fc18c2 x86/cpu/intel: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20240520224620.9480-29-tony.luck%40intel.com
2024-05-28 10:59:02 -07:00
Alison Schofield
079544ec60 x86/pconfig: Remove unused MKTME pconfig code
Code supporting Intel PCONFIG targets was an early piece of enabling
for MKTME (Multi-Key Total Memory Encryption).

Since MKTME feature enablement did not follow into the kernel, remove
the unused PCONFIG code.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/all/4ddff30d466785b4adb1400f0518783012835141.1715054189.git.alison.schofield%40intel.com
2024-05-28 08:45:17 -07:00
Alison Schofield
98b83cf0c1 x86/cpu: Remove useless work in detect_tme_early()
TME (Total Memory Encryption) and MKTME (Multi-Key Total Memory
Encryption) BIOS detection were introduced together here [1] and
are loosely coupled in the Intel CPU init code.

TME is a hardware only feature and its BIOS status is all that needs
to be shared with the kernel user: enabled or disabled. The TME
algorithm the BIOS is using and whether or not the kernel recognizes
that algorithm is useless to the kernel user.

MKTME is a hardware feature that requires kernel support. MKTME
detection code was added in advance of broader kernel support for
MKTME that never followed. So, rather than continuing to spew
needless and confusing messages about BIOS MKTME status, remove
most of the MKTME pieces from detect_tme_early().

Keep one useful message: alert the user when BIOS enabled MKTME
reduces the available physical address bits. Recovery of the MKTME
consumed bits requires a reboot with MKTME disabled in BIOS.

There is no functional change for the user, only a change in boot
messages. Below is one example when both TME and MKTME are enabled
in BIOS with AES_XTS_256 which is unknown to the detect tme code.

Before:
[] x86/tme: enabled by BIOS
[] x86/tme: Unknown policy is active: 0x2
[] x86/mktme: No known encryption algorithm is supported: 0x4
[] x86/mktme: enabled by BIOS
[] x86/mktme: 127 KeyIDs available

After:
[] x86/tme: enabled by BIOS
[] x86/mktme: BIOS enable: x86_phys_bits reduced by 8

[1]
commit cb06d8e3d0 ("x86/tme: Detect if TME and MKTME is activated by BIOS")

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/86dfdf6ced8c9b790f9376bf6c7e22b5608f47c2.1715054189.git.alison.schofield%40intel.com
2024-05-28 08:45:17 -07:00
Borislav Petkov (AMD)
0c40b1c7a8 x86/setup: Warn when option parsing is done too early
Commit

  4faa0e5d6d ("x86/boot: Move kernel cmdline setup earlier in the boot process (again)")

fixed and issue where cmdline parsing would happen before the final
boot_command_line string has been built from the builtin and boot
cmdlines and thus cmdline arguments would get lost.

Add a check to catch any future wrong use ordering so that such issues
can be caught in time.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240409152541.GCZhVd9XIPXyTNd9vc@fat_crate.local
2024-05-27 18:54:45 +02:00
Yazen Ghannam
5b9d292ea8 x86/mce: Remove unused variable and return value in machine_check_poll()
The recent CMCI storm handling rework removed the last case that checks
the return value of machine_check_poll().

Therefore the "error_seen" variable is no longer used, so remove it.

Fixes: 3ed57b41a4 ("x86/mce: Remove old CMCI storm mitigation code")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240523155641.2805411-3-yazen.ghannam@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-05-27 10:49:25 +02:00
Yazen Ghannam
ede18982f1 x86/mce/inject: Only write MCA_MISC when a value has been supplied
The MCA_MISC register is used to control the MCA thresholding feature on
AMD systems. Therefore, it is not generally part of the error state that
a user would adjust when testing non-thresholding cases.

However, MCA_MISC is unconditionally written even if a user does not
supply a value. The default value of '0' will be used and clobber the
register.

Write the MCA_MISC register only if the user has given a value for it.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240523155641.2805411-2-yazen.ghannam@amd.com
2024-05-27 10:42:35 +02:00
Linus Torvalds
a0db36ed57 Misc fixes:
- Fix x86 IRQ vector leak caused by a CPU offlining race
 
  - Fix build failure in the riscv-imsic irqchip driver
    caused by an API-change semantic conflict
 
  - Fix use-after-free in irq_find_at_or_after()
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZRwMURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h/zQ//TTrgyXi6+1xXY4R0LDU45j+wavMTMkq3
 kM3eUeyXgy+FDtvLRVaYgEAYbtuR4LGFN9qmVuEHJPZQwpi3AFlnGFUFjFUvyE43
 xJuOtHoxFv3mj09VgRGsjZvzp8bxYSkEn3h0ryTWGUHzR+QmoQmYWrU6HExgXw3R
 +s8pvi14g6R/+PAy05cF0k1J7aeSsYaOfd38D/XnpyhuhXvPMS2eHgovV6I5Qhk4
 5lV6rzJv8XlKxVr7bOYJkRePE3z0HMtx0G7eo8eYERBQapHede18V8imv4OpUiua
 vmG8cFhF4Lq9KFdEtiVuf1X9/XH3PoEKTGA81oqQ9lLN9USx7ME/Peg6U5ezvEkp
 YmQx2LS12DWqYp5PZQTN0CHnfmMLgksmyGELM3JE/dFFCVh4HdpMrh+2wLwWGRJ3
 JLzAJh3YwcPhayLpNVgsSF9AtLKTkDoS0bHd43mHnB6VaEKkus8zbeuCxYAsUeMJ
 5wCZw3xQjTZEaMMNd1hJN5O/9TX2of+T6Z4C4cacMBmwpD7vX5oXmDYLE/wUHw6m
 9Z67fvOvTdIf3MkYSqjGXFKD1JobL/PmwCfaaGUQFVJkbX5WVNDk6C1zgs5FhmuY
 U/AcYfadbNdLVXrN3VLnX6Gmb7gFPShOAE1GgXGeszSReI4pbOUy2zopRGAEWSZS
 fRu8nyveGjw=
 =vxJh
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Ingo Molnar:

 - Fix x86 IRQ vector leak caused by a CPU offlining race

 - Fix build failure in the riscv-imsic irqchip driver
   caused by an API-change semantic conflict

 - Fix use-after-free in irq_find_at_or_after()

* tag 'irq-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/irqdesc: Prevent use-after-free in irq_find_at_or_after()
  genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offline
  irqchip/riscv-imsic: Fixup riscv_ipi_set_virq_range() conflict
2024-05-25 14:48:40 -07:00
Linus Torvalds
3a390f24b7 Miscellaneous fixes:
- Fix regressions of the new x86 CPU VFM (vendor/family/model)
    enumeration/matching code
 
  - Fix crash kernel detection on buggy firmware with
    non-compliant ACPI MADT tables
 
  - Address Kconfig warning
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZRvZwRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iDhg//fWdgn2x4B4hEiCQQVYGpHLua59jduucb
 oh+NF1jGu75obRu3gdQJLR7OYjxntWf2ryxpS+kkyLP6sYeMAoL4vnIoi5XJGJ13
 VG0BcbXA8asL2KEfVz66AENAGQjAGr7Bcg9urHIuw8Nz6lJqaFyQkdQJfeClWtdL
 zkgnCdooP1eREgfrQH5+hhTCrr/GwiFUU+wNKeIIpis1enMZYMiqA5U23w3DKlP8
 Jx0cRY7ysa63O/H9oD01edRPkZpfbMqAocVwc9v42zOjlJLZYAtAW4mSC+GhG9X6
 iGFWiW1ROBte/HYLE1LdKfahO990Tw0GsIcS42E8AtYfVu/W7U525SyKG100ndYH
 nVoUSOPWF8YCT810YtOEM2ueMQKZMEjB8yAp5QQIi2NMcgkFxNdVQiC8zFATisHd
 KFdEkH2fDGW9YiUNRBYjI/da3Q2v83JwAIKnYXmoFjcru4iJOPDIFdGZcJDh7oNW
 ys/SWSK5dJkbLz+cHm8E5ceLTpZTsFHJm1Vd1W2gU/jkESBW/2i1rZ757ykfHURe
 N7JUPI4g0DOVj8Elket9gnKD/xVFg/lsTnA1/5wxdWhWhJzZcM/XyICATno1/BaY
 STWUmUr6sTsoB4+2PRuFC2zaRqIstLbkmKOAlHezd4uIwFxznQAg1K8f8kHlqfLH
 l3VA8nRbOFc=
 =nIcm
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - Fix regressions of the new x86 CPU VFM (vendor/family/model)
   enumeration/matching code

 - Fix crash kernel detection on buggy firmware with
   non-compliant ACPI MADT tables

 - Address Kconfig warning

* tag 'x86-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Fix x86_match_cpu() to match just X86_VENDOR_INTEL
  crypto: x86/aes-xts - switch to new Intel CPU model defines
  x86/topology: Handle bogus ACPI tables correctly
  x86/kconfig: Select ARCH_WANT_FRAME_POINTERS again when UNWINDER_FRAME_POINTER=y
2024-05-25 14:40:09 -07:00
Dongli Zhang
a6c11c0a52 genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offline
The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
interrupt affinity reconfiguration via procfs. Instead, the change is
deferred until the next instance of the interrupt being triggered on the
original CPU.

When the interrupt next triggers on the original CPU, the new affinity is
enforced within __irq_move_irq(). A vector is allocated from the new CPU,
but the old vector on the original CPU remains and is not immediately
reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming
process is delayed until the next trigger of the interrupt on the new CPU.

Upon the subsequent triggering of the interrupt on the new CPU,
irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
remains online. Subsequently, the timer on the old CPU iterates over its
vector_cleanup list, reclaiming old vectors.

However, a rare scenario arises if the old CPU is outgoing before the
interrupt triggers again on the new CPU.

In that case irq_force_complete_move() is not invoked on the outgoing CPU
to reclaim the old apicd->prev_vector because the interrupt isn't currently
affine to the outgoing CPU, and irq_needs_fixup() returns false. Even
though __vector_schedule_cleanup() is later called on the new CPU, it
doesn't reclaim apicd->prev_vector; instead, it simply resets both
apicd->move_in_progress and apicd->prev_vector to 0.

As a result, the vector remains unreclaimed in vector_matrix, leading to a
CPU vector leak.

To address this issue, move the invocation of irq_force_complete_move()
before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the
interrupt is currently or used to be affine to the outgoing CPU.

Additionally, reclaim the vector in __vector_schedule_cleanup() as well,
following a warning message, although theoretically it should never see
apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU.

Fixes: f0383c24b4 ("genirq/cpuhotplug: Add support for cleaning up move in progress")
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240522220218.162423-1-dongli.zhang@oracle.com
2024-05-23 21:51:50 +02:00
Tony Luck
93022482b2 x86/cpu: Fix x86_match_cpu() to match just X86_VENDOR_INTEL
Code in v6.9 arch/x86/kernel/smpboot.c was changed by commit

  4db64279bc ("x86/cpu: Switch to new Intel CPU model defines") from:

  static const struct x86_cpu_id intel_cod_cpu[] = {
          X86_MATCH_INTEL_FAM6_MODEL(HASWELL_X, 0),       /* COD */
          X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_X, 0),     /* COD */
          X86_MATCH_INTEL_FAM6_MODEL(ANY, 1),             /* SNC */	<--- 443
          {}
  };

  static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
  {
          const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu);

to:

  static const struct x86_cpu_id intel_cod_cpu[] = {
           X86_MATCH_VFM(INTEL_HASWELL_X,   0),    /* COD */
           X86_MATCH_VFM(INTEL_BROADWELL_X, 0),    /* COD */
           X86_MATCH_VFM(INTEL_ANY,         1),    /* SNC */
           {}
   };

  static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
  {
          const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu);

On an Intel CPU with SNC enabled this code previously matched the rule on line
443 to avoid printing messages about insane cache configuration.  The new code
did not match any rules.

Expanding the macros for the intel_cod_cpu[] array shows that the old is
equivalent to:

  static const struct x86_cpu_id intel_cod_cpu[] = {
  [0] = { .vendor = 0, .family = 6, .model = 0x3F, .steppings = 0, .feature = 0, .driver_data = 0 },
  [1] = { .vendor = 0, .family = 6, .model = 0x4F, .steppings = 0, .feature = 0, .driver_data = 0 },
  [2] = { .vendor = 0, .family = 6, .model = 0x00, .steppings = 0, .feature = 0, .driver_data = 1 },
  [3] = { .vendor = 0, .family = 0, .model = 0x00, .steppings = 0, .feature = 0, .driver_data = 0 }
  }

while the new code expands to:

  static const struct x86_cpu_id intel_cod_cpu[] = {
  [0] = { .vendor = 0, .family = 6, .model = 0x3F, .steppings = 0, .feature = 0, .driver_data = 0 },
  [1] = { .vendor = 0, .family = 6, .model = 0x4F, .steppings = 0, .feature = 0, .driver_data = 0 },
  [2] = { .vendor = 0, .family = 0, .model = 0x00, .steppings = 0, .feature = 0, .driver_data = 1 },
  [3] = { .vendor = 0, .family = 0, .model = 0x00, .steppings = 0, .feature = 0, .driver_data = 0 }
  }

Looking at the code for x86_match_cpu():

  const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match)
  {
           const struct x86_cpu_id *m;
           struct cpuinfo_x86 *c = &boot_cpu_data;

           for (m = match;
                m->vendor | m->family | m->model | m->steppings | m->feature;
                m++) {
       		...
           }
           return NULL;

it is clear that there was no match because the ANY entry in the table (array
index 2) is now the loop termination condition (all of vendor, family, model,
steppings, and feature are zero).

So this code was working before because the "ANY" check was looking for any
Intel CPU in family 6. But fails now because the family is a wild card. So the
root cause is that x86_match_cpu() has never been able to match on a rule with
just X86_VENDOR_INTEL and all other fields set to wildcards.

Add a new flags field to struct x86_cpu_id that has a bit set to indicate that
this entry in the array is valid. Update X86_MATCH*() macros to set that bit.
Change the end-marker check in x86_match_cpu() to just check the flags field
for this bit.

Backporter notes: The commit in Fixes is really the one that is broken:
you can't have m->vendor as part of the loop termination conditional in
x86_match_cpu() because it can happen - as it has happened above
- that that whole conditional is 0 albeit vendor == 0 is a valid case
- X86_VENDOR_INTEL is 0.

However, the only case where the above happens is the SNC check added by
4db64279bc so you only need this fix if you have backported that
other commit

  4db64279bc ("x86/cpu: Switch to new Intel CPU model defines")

Fixes: 644e9cbbe3 ("Add driver auto probing for x86 features v4")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable+noautosel@kernel.org> # see above
Link: https://lore.kernel.org/r/20240517144312.GBZkdtAOuJZCvxhFbJ@fat_crate.local
2024-05-22 11:31:10 +02:00
Jani Nikula
cfa7772880 drm/i915/pciids: switch to xe driver style PCI ID macros
The PCI ID macros in xe_pciids.h allow passing in the macro to operate
on each PCI ID, making it more flexible. Convert i915_pciids.h to the
same pattern.

INTEL_IVB_Q_IDS() for Quanta transcode remains a special case, and
unconditionally uses INTEL_QUANTA_VGA_DEVICE().

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240515165651.1230465-1-jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-05-22 12:12:09 +03:00
Linus Torvalds
f0bae243b2 pci-v6.10-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmZLzNIUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vwr/Q//STe2XGKI8bAKqP2wbbkzm+ISnK4A
 Lqf3FEAIXunxDRspszfXKKV2p4vaIkmOFiwIdtp/kWvd0DQn5+ATXJ/iQtp8aFX/
 R+6BQ7EZc2G7fN5fbQuK54+CvmWEpkKEMbXYbd6ivQ14Cijdb3Nbu+w+DYFjS+6C
 k2a9lS1bTW7Xcy0fyiO1w6GQiWqtmOH8U3OlQtIrI0EVkDG9OG1LsLuc92/FgkOo
 REN+sU+hX1K5fHrvm2CtjYDn/9/B6bJ/It22H1dPgUL9nKvKC67fYzosMtUCOX1M
 6XSPjZIuXOmQGeZXHhpSlVwaidxoUjYO98I7nMquxKdCy6yct3geK7ULG/xeQCgD
 ML7MGQB4+sTiSWalXUQaziKqF1FIDEvU3HMGXFWnoBL5l56eRp8KS1EI9Eqk9pU3
 pk9fJaCkcFnkzPtMFzqPOm5q9zUZ6bGbfYb0hs72TUKplmVDhFo2T1YsW2AOyHZ7
 mjuDzUYZX0H7uM1tntA56IgZX+oNOrLvhBt5L5M/BQeCsZFBBUfIcAEaYoL9LwXO
 AYgIG3jdqzHHyAUzutJF+XHKinJLMHm0XVYbFmO6saPhFzrUJSNHqT7NzW1DGGTl
 OnO8e1WNMX1EcnKvnc6fXyGmM3SgVwy45FsbG/zRnhn4uBKqKtjrh6uX/myA22LK
 CSeqSUK9XmXxFNA=
 =xjoS
 -----END PGP SIGNATURE-----

Merge tag 'pci-v6.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci updates from Bjorn Helgaas:
 "Enumeration:

   - Skip E820 checks for MCFG ECAM regions for new (2016+) machines,
     since there's no requirement to describe them in E820 and some
     platforms require ECAM to work (Bjorn Helgaas)

   - Rename PCI_IRQ_LEGACY to PCI_IRQ_INTX to be more specific (Damien
     Le Moal)

   - Remove last user and pci_enable_device_io() (Heiner Kallweit)

   - Wait for Link Training==0 to avoid possible race (Ilpo Järvinen)

   - Skip waiting for devices that have been disconnected while
     suspended (Ilpo Järvinen)

   - Clear Secondary Status errors after enumeration since Master Aborts
     and Unsupported Request errors are an expected part of enumeration
     (Vidya Sagar)

  MSI:

   - Remove unused IMS (Interrupt Message Store) support (Bjorn Helgaas)

  Error handling:

   - Mask Genesys GL975x SD host controller Replay Timer Timeout
     correctable errors caused by a hardware defect; the errors cause
     interrupts that prevent system suspend (Kai-Heng Feng)

   - Fix EDR-related _DSM support, which previously evaluated revision 5
     but assumed revision 6 behavior (Kuppuswamy Sathyanarayanan)

  ASPM:

   - Simplify link state definitions and mask calculation (Ilpo
     Järvinen)

  Power management:

   - Avoid D3cold for HP Pavilion 17 PC/1972 PCIe Ports, where BIOS
     apparently doesn't know how to put them back in D0 (Mario
     Limonciello)

  CXL:

   - Support resetting CXL devices; special handling required because
     CXL Ports mask Secondary Bus Reset by default (Dave Jiang)

  DOE:

   - Support DOE Discovery Version 2 (Alexey Kardashevskiy)

  Endpoint framework:

   - Set endpoint BAR to be 64-bit if the driver says that's all the
     device supports, in addition to doing so if the size is >2GB
     (Niklas Cassel)

   - Simplify endpoint BAR allocation and setting interfaces (Niklas
     Cassel)

  Cadence PCIe controller driver:

   - Drop DT binding redundant msi-parent and pci-bus.yaml (Krzysztof
     Kozlowski)

  Cadence PCIe endpoint driver:

   - Configure endpoint BARs to be 64-bit based on the BAR type, not the
     BAR value (Niklas Cassel)

  Freescale Layerscape PCIe controller driver:

   - Convert DT binding to YAML (Frank Li)

  MediaTek MT7621 PCIe controller driver:

   - Add DT binding missing 'reg' property for child Root Ports
     (Krzysztof Kozlowski)

   - Fix theoretical string truncation in PHY name (Sergio Paracuellos)

  NVIDIA Tegra194 PCIe controller driver:

   - Return success for endpoint probe instead of falling through to the
     failure path (Vidya Sagar)

  Renesas R-Car PCIe controller driver:

   - Add DT binding missing IOMMU properties (Geert Uytterhoeven)

   - Add DT binding R-Car V4H compatible for host and endpoint mode
     (Yoshihiro Shimoda)

  Rockchip PCIe controller driver:

   - Configure endpoint BARs to be 64-bit based on the BAR type, not the
     BAR value (Niklas Cassel)

   - Add DT binding missing maxItems to ep-gpios (Krzysztof Kozlowski)

   - Set the Subsystem Vendor ID, which was previously zero because it
     was masked incorrectly (Rick Wertenbroek)

  Synopsys DesignWare PCIe controller driver:

   - Restructure DBI register access to accommodate devices where this
     requires Refclk to be active (Manivannan Sadhasivam)

   - Remove the deinit() callback, which was only need by the
     pcie-rcar-gen4, and do it directly in that driver (Manivannan
     Sadhasivam)

   - Add dw_pcie_ep_cleanup() so drivers that support PERST# can clean
     up things like eDMA (Manivannan Sadhasivam)

   - Rename dw_pcie_ep_exit() to dw_pcie_ep_deinit() to make it parallel
     to dw_pcie_ep_init() (Manivannan Sadhasivam)

   - Rename dw_pcie_ep_init_complete() to dw_pcie_ep_init_registers() to
     reflect the actual functionality (Manivannan Sadhasivam)

   - Call dw_pcie_ep_init_registers() directly from all the glue
     drivers, not just those that require active Refclk from the host
     (Manivannan Sadhasivam)

   - Remove the "core_init_notifier" flag, which was an obscure way for
     glue drivers to indicate that they depend on Refclk from the host
     (Manivannan Sadhasivam)

  TI J721E PCIe driver:

   - Add DT binding J784S4 SoC Device ID (Siddharth Vadapalli)

   - Add DT binding J722S SoC support (Siddharth Vadapalli)

  TI Keystone PCIe controller driver:

   - Add DT binding missing num-viewport, phys and phy-name properties
     (Jan Kiszka)

  Miscellaneous:

   - Constify and annotate with __ro_after_init (Heiner Kallweit)

   - Convert DT bindings to YAML (Krzysztof Kozlowski)

   - Check for kcalloc() failure in of_pci_prop_intr_map() (Duoming
     Zhou)"

* tag 'pci-v6.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (97 commits)
  PCI: Do not wait for disconnected devices when resuming
  x86/pci: Skip early E820 check for ECAM region
  PCI: Remove unused pci_enable_device_io()
  ata: pata_cs5520: Remove unnecessary call to pci_enable_device_io()
  PCI: Update pci_find_capability() stub return types
  PCI: Remove PCI_IRQ_LEGACY
  scsi: vmw_pvscsi: Do not use PCI_IRQ_LEGACY instead of PCI_IRQ_LEGACY
  scsi: pmcraid: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: mpt3sas: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: megaraid_sas: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: ipr: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: hpsa: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: arcmsr: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  wifi: rtw89: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  dt-bindings: PCI: rockchip,rk3399-pcie: Add missing maxItems to ep-gpios
  Revert "genirq/msi: Provide constants for PCI/IMS support"
  Revert "x86/apic/msi: Enable PCI/IMS"
  Revert "iommu/vt-d: Enable PCI/IMS"
  Revert "iommu/amd: Enable PCI/IMS"
  Revert "PCI/MSI: Provide IMS (Interrupt Message Store) support"
  ...
2024-05-21 10:09:28 -07:00
Thomas Gleixner
9d22c96316 x86/topology: Handle bogus ACPI tables correctly
The ACPI specification clearly states how the processors should be
enumerated in the MADT:

 "To ensure that the boot processor is supported post initialization,
  two guidelines should be followed. The first is that OSPM should
  initialize processors in the order that they appear in the MADT. The
  second is that platform firmware should list the boot processor as the
  first processor entry in the MADT.
  ...
  Failure of OSPM implementations and platform firmware to abide by
  these guidelines can result in both unpredictable and non optimal
  platform operation."

The kernel relies on that ordering to detect the real BSP on crash kernels
which is important to avoid sending a INIT IPI to it as that would cause a
full machine reset.

On a Dell XPS 16 9640 the BIOS ignores this rule and enumerates the CPUs in
the wrong order. As a consequence the kernel falsely detects a crash kernel
and disables the corresponding CPU.

Prevent this by checking the IA32_APICBASE MSR for the BSP bit on the boot
CPU. If that bit is set, then the MADT based BSP detection can be safely
ignored. If the kernel detects a mismatch between the BSP bit and the first
enumerated MADT entry then emit a firmware bug message.

This obviously also has to be taken into account when the boot APIC ID and
the first enumerated APIC ID match. If the boot CPU does not have the BSP
bit set in the APICBASE MSR then there is no way for the boot CPU to
determine which of the CPUs is the real BSP. Sending an INIT to the real
BSP would reset the machine so the only sane way to deal with that is to
limit the number of CPUs to one and emit a corresponding warning message.

Fixes: 5c5682b9f8 ("x86/cpu: Detect real BSP on crash kernels")
Reported-by: Carsten Tolkmit <ctolkmit@ennit.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Carsten Tolkmit <ctolkmit@ennit.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87le48jycb.ffs@tglx
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218837
2024-05-21 14:52:35 +02:00
Linus Torvalds
41c14f1ac8 Miscellaneous fixes:
- Fix a NOP-patching bug that resulted in valid
    but suboptimal NOP sequences in certain cases.
 
  - Fix build warnings related to fall-through control flow
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZIb60RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i+3A//UGWEycvDubOlFSakMy8Nyh4luUPvRhoX
 SLp/BVgASz8EgXwA8gb5fQILKHIW0HofsEm+IjC+crzy/Sm7HV/GFvG80H59YyKS
 wGnJq6f0HWy5Cm/7zrEgg13nh8jCwIp6sJ/dGgyvGqK7YPpH7dfHFJ9r3ZPY5AT3
 xI1U+IhnWEY4yQLMKiIHONaomnTbyoXcKsr8lmshCw3qSgSF9177onD3DX/uQZ/L
 iO0T1wxRsD92BD2v2tZHJCjBAO/NtJiM2Up6SlZNCaBTDn0oEbUNzfNL+fGDad2X
 Y8TjWQWu7YPN7nXxVj52T0JG4C31A5gQsCQTNiGNKFN8CPuf9qulZSf65VEvGliR
 caYhnEp8wVDwHz0vxB9zVaHh5QVyET5JqmrDGBjjDV/N9s8lYMCaeKxnaeBRPDQZ
 dAFe1TjH9OfiA5PYQGut3ZrjUxqC+Gec3oD/ofhBQjjf8Hi5lWO/4+iXiXhh+UfK
 j6GVbXIQW9S81AKlGDMBQKqE541ibA3tzye+Hdj8fMeDqyXG8R2Movx6KRQVt0wD
 5ctjWDQ4YBSdc8VOEOJj4WhZT1295ff/by7OTVLkW1IN7CbVMu72nyzG8QA8c9At
 35TTEBz+bUipIxohHqhi5WrSLQBgSE/Ns0T6O+GBOXUAWAPIVicuGFatbVnKKTOI
 lJs5oHcSHHs=
 =ZH5x
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - Fix a NOP-patching bug that resulted in valid but suboptimal
   NOP sequences in certain cases

 - Fix build warnings related to fall-through control flow

* tag 'x86-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternatives: Use the correct length when optimizing NOPs
  x86/boot: Address clang -Wimplicit-fallthrough in vsprintf()
  x86/boot: Add a fallthrough annotation
2024-05-19 11:42:29 -07:00
Linus Torvalds
61307b7be4 The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.  Notable
 series include:
 
 - Lucas Stach has provided some page-mapping
   cleanup/consolidation/maintainability work in the series "mm/treewide:
   Remove pXd_huge() API".
 
 - In the series "Allow migrate on protnone reference with
   MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
   MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one
   test.
 
 - In their series "Memory allocation profiling" Kent Overstreet and
   Suren Baghdasaryan have contributed a means of determining (via
   /proc/allocinfo) whereabouts in the kernel memory is being allocated:
   number of calls and amount of memory.
 
 - Matthew Wilcox has provided the series "Various significant MM
   patches" which does a number of rather unrelated things, but in largely
   similar code sites.
 
 - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes
   Weiner has fixed the page allocator's handling of migratetype requests,
   with resulting improvements in compaction efficiency.
 
 - In the series "make the hugetlb migration strategy consistent" Baolin
   Wang has fixed a hugetlb migration issue, which should improve hugetlb
   allocation reliability.
 
 - Liu Shixin has hit an I/O meltdown caused by readahead in a
   memory-tight memcg.  Addressed in the series "Fix I/O high when memory
   almost met memcg limit".
 
 - In the series "mm/filemap: optimize folio adding and splitting" Kairui
   Song has optimized pagecache insertion, yielding ~10% performance
   improvement in one test.
 
 - Baoquan He has cleaned up and consolidated the early zone
   initialization code in the series "mm/mm_init.c: refactor
   free_area_init_core()".
 
 - Baoquan has also redone some MM initializatio code in the series
   "mm/init: minor clean up and improvement".
 
 - MM helper cleanups from Christoph Hellwig in his series "remove
   follow_pfn".
 
 - More cleanups from Matthew Wilcox in the series "Various page->flags
   cleanups".
 
 - Vlastimil Babka has contributed maintainability improvements in the
   series "memcg_kmem hooks refactoring".
 
 - More folio conversions and cleanups in Matthew Wilcox's series
 
 	"Convert huge_zero_page to huge_zero_folio"
 	"khugepaged folio conversions"
 	"Remove page_idle and page_young wrappers"
 	"Use folio APIs in procfs"
 	"Clean up __folio_put()"
 	"Some cleanups for memory-failure"
 	"Remove page_mapping()"
 	"More folio compat code removal"
 
 - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb
   functions to work on folis".
 
 - Code consolidation and cleanup work related to GUP's handling of
   hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
 
 - Rick Edgecombe has developed some fixes to stack guard gaps in the
   series "Cover a guard gap corner case".
 
 - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series
   "mm/ksm: fix ksm exec support for prctl".
 
 - Baolin Wang has implemented NUMA balancing for multi-size THPs.  This
   is a simple first-cut implementation for now.  The series is "support
   multi-size THP numa balancing".
 
 - Cleanups to vma handling helper functions from Matthew Wilcox in the
   series "Unify vma_address and vma_pgoff_address".
 
 - Some selftests maintenance work from Dev Jain in the series
   "selftests/mm: mremap_test: Optimizations and style fixes".
 
 - Improvements to the swapping of multi-size THPs from Ryan Roberts in
   the series "Swap-out mTHP without splitting".
 
 - Kefeng Wang has significantly optimized the handling of arm64's
   permission page faults in the series
 
 	"arch/mm/fault: accelerate pagefault when badaccess"
 	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
 
 - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it
   GUP-fast".
 
 - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to
   use struct vm_fault".
 
 - selftests build fixes from John Hubbard in the series "Fix
   selftests/mm build without requiring "make headers"".
 
 - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
   series "Improved Memory Tier Creation for CPUless NUMA Nodes".  Fixes
   the initialization code so that migration between different memory types
   works as intended.
 
 - David Hildenbrand has improved follow_pte() and fixed an errant driver
   in the series "mm: follow_pte() improvements and acrn follow_pte()
   fixes".
 
 - David also did some cleanup work on large folio mapcounts in his
   series "mm: mapcount for large folios + page_mapcount() cleanups".
 
 - Folio conversions in KSM in Alex Shi's series "transfer page to folio
   in KSM".
 
 - Barry Song has added some sysfs stats for monitoring multi-size THP's
   in the series "mm: add per-order mTHP alloc and swpout counters".
 
 - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled
   and limit checking cleanups".
 
 - Matthew Wilcox has been looking at buffer_head code and found the
   documentation to be lacking.  The series is "Improve buffer head
   documentation".
 
 - Multi-size THPs get more work, this time from Lance Yang.  His series
   "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes
   the freeing of these things.
 
 - Kemeng Shi has added more userspace-visible writeback instrumentation
   in the series "Improve visibility of writeback".
 
 - Kemeng Shi then sent some maintenance work on top in the series "Fix
   and cleanups to page-writeback".
 
 - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the
   series "Improve anon_vma scalability for anon VMAs".  Intel's test bot
   reported an improbable 3x improvement in one test.
 
 - SeongJae Park adds some DAMON feature work in the series
 
 	"mm/damon: add a DAMOS filter type for page granularity access recheck"
 	"selftests/damon: add DAMOS quota goal test"
 
 - Also some maintenance work in the series
 
 	"mm/damon/paddr: simplify page level access re-check for pageout"
 	"mm/damon: misc fixes and improvements"
 
 - David Hildenbrand has disabled some known-to-fail selftests ni the
   series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL".
 
 - memcg metadata storage optimizations from Shakeel Butt in "memcg:
   reduce memory consumption by memcg stats".
 
 - DAX fixes and maintenance work from Vishal Verma in the series
   "dax/bus.c: Fixups for dax-bus locking".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA
 jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB
 nvA4E0DcPrUAFy144FNM0NTCb7u9vAw=
 =V3R/
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:
 "The usual shower of singleton fixes and minor series all over MM,
  documented (hopefully adequately) in the respective changelogs.
  Notable series include:

   - Lucas Stach has provided some page-mapping cleanup/consolidation/
     maintainability work in the series "mm/treewide: Remove pXd_huge()
     API".

   - In the series "Allow migrate on protnone reference with
     MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
     MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
     one test.

   - In their series "Memory allocation profiling" Kent Overstreet and
     Suren Baghdasaryan have contributed a means of determining (via
     /proc/allocinfo) whereabouts in the kernel memory is being
     allocated: number of calls and amount of memory.

   - Matthew Wilcox has provided the series "Various significant MM
     patches" which does a number of rather unrelated things, but in
     largely similar code sites.

   - In his series "mm: page_alloc: freelist migratetype hygiene"
     Johannes Weiner has fixed the page allocator's handling of
     migratetype requests, with resulting improvements in compaction
     efficiency.

   - In the series "make the hugetlb migration strategy consistent"
     Baolin Wang has fixed a hugetlb migration issue, which should
     improve hugetlb allocation reliability.

   - Liu Shixin has hit an I/O meltdown caused by readahead in a
     memory-tight memcg. Addressed in the series "Fix I/O high when
     memory almost met memcg limit".

   - In the series "mm/filemap: optimize folio adding and splitting"
     Kairui Song has optimized pagecache insertion, yielding ~10%
     performance improvement in one test.

   - Baoquan He has cleaned up and consolidated the early zone
     initialization code in the series "mm/mm_init.c: refactor
     free_area_init_core()".

   - Baoquan has also redone some MM initializatio code in the series
     "mm/init: minor clean up and improvement".

   - MM helper cleanups from Christoph Hellwig in his series "remove
     follow_pfn".

   - More cleanups from Matthew Wilcox in the series "Various
     page->flags cleanups".

   - Vlastimil Babka has contributed maintainability improvements in the
     series "memcg_kmem hooks refactoring".

   - More folio conversions and cleanups in Matthew Wilcox's series:
	"Convert huge_zero_page to huge_zero_folio"
	"khugepaged folio conversions"
	"Remove page_idle and page_young wrappers"
	"Use folio APIs in procfs"
	"Clean up __folio_put()"
	"Some cleanups for memory-failure"
	"Remove page_mapping()"
	"More folio compat code removal"

   - David Hildenbrand chipped in with "fs/proc/task_mmu: convert
     hugetlb functions to work on folis".

   - Code consolidation and cleanup work related to GUP's handling of
     hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

   - Rick Edgecombe has developed some fixes to stack guard gaps in the
     series "Cover a guard gap corner case".

   - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
     series "mm/ksm: fix ksm exec support for prctl".

   - Baolin Wang has implemented NUMA balancing for multi-size THPs.
     This is a simple first-cut implementation for now. The series is
     "support multi-size THP numa balancing".

   - Cleanups to vma handling helper functions from Matthew Wilcox in
     the series "Unify vma_address and vma_pgoff_address".

   - Some selftests maintenance work from Dev Jain in the series
     "selftests/mm: mremap_test: Optimizations and style fixes".

   - Improvements to the swapping of multi-size THPs from Ryan Roberts
     in the series "Swap-out mTHP without splitting".

   - Kefeng Wang has significantly optimized the handling of arm64's
     permission page faults in the series
	"arch/mm/fault: accelerate pagefault when badaccess"
	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

   - GUP cleanups from David Hildenbrand in "mm/gup: consistently call
     it GUP-fast".

   - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
     path to use struct vm_fault".

   - selftests build fixes from John Hubbard in the series "Fix
     selftests/mm build without requiring "make headers"".

   - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
     series "Improved Memory Tier Creation for CPUless NUMA Nodes".
     Fixes the initialization code so that migration between different
     memory types works as intended.

   - David Hildenbrand has improved follow_pte() and fixed an errant
     driver in the series "mm: follow_pte() improvements and acrn
     follow_pte() fixes".

   - David also did some cleanup work on large folio mapcounts in his
     series "mm: mapcount for large folios + page_mapcount() cleanups".

   - Folio conversions in KSM in Alex Shi's series "transfer page to
     folio in KSM".

   - Barry Song has added some sysfs stats for monitoring multi-size
     THP's in the series "mm: add per-order mTHP alloc and swpout
     counters".

   - Some zswap cleanups from Yosry Ahmed in the series "zswap
     same-filled and limit checking cleanups".

   - Matthew Wilcox has been looking at buffer_head code and found the
     documentation to be lacking. The series is "Improve buffer head
     documentation".

   - Multi-size THPs get more work, this time from Lance Yang. His
     series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
     optimizes the freeing of these things.

   - Kemeng Shi has added more userspace-visible writeback
     instrumentation in the series "Improve visibility of writeback".

   - Kemeng Shi then sent some maintenance work on top in the series
     "Fix and cleanups to page-writeback".

   - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
     the series "Improve anon_vma scalability for anon VMAs". Intel's
     test bot reported an improbable 3x improvement in one test.

   - SeongJae Park adds some DAMON feature work in the series
	"mm/damon: add a DAMOS filter type for page granularity access recheck"
	"selftests/damon: add DAMOS quota goal test"

   - Also some maintenance work in the series
	"mm/damon/paddr: simplify page level access re-check for pageout"
	"mm/damon: misc fixes and improvements"

   - David Hildenbrand has disabled some known-to-fail selftests ni the
     series "selftests: mm: cow: flag vmsplice() hugetlb tests as
     XFAIL".

   - memcg metadata storage optimizations from Shakeel Butt in "memcg:
     reduce memory consumption by memcg stats".

   - DAX fixes and maintenance work from Vishal Verma in the series
     "dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
  memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
  selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
  selftests: cgroup: add tests to verify the zswap writeback path
  mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
  mm/damon/core: fix return value from damos_wmark_metric_value
  mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
  selftests: cgroup: remove redundant enabling of memory controller
  Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
  Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
  Docs/mm/damon/design: use a list for supported filters
  Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
  Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
  selftests/damon: classify tests for functionalities and regressions
  selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
  selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
  selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
  mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
  selftests/damon: add a test for DAMOS quota goal
  ...
2024-05-19 09:21:03 -07:00
Linus Torvalds
ff9a79307f Kbuild updates for v6.10
- Avoid 'constexpr', which is a keyword in C23
 
  - Allow 'dtbs_check' and 'dt_compatible_check' run independently of
    'dt_binding_check'
 
  - Fix weak references to avoid GOT entries in position-independent
    code generation
 
  - Convert the last use of 'optional' property in arch/sh/Kconfig
 
  - Remove support for the 'optional' property in Kconfig
 
  - Remove support for Clang's ThinLTO caching, which does not work with
    the .incbin directive
 
  - Change the semantics of $(src) so it always points to the source
    directory, which fixes Makefile inconsistencies between upstream and
    downstream
 
  - Fix 'make tar-pkg' for RISC-V to produce a consistent package
 
  - Provide reasonable default coverage for objtool, sanitizers, and
    profilers
 
  - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
 
  - Remove the last use of tristate choice in drivers/rapidio/Kconfig
 
  - Various cleanups and fixes in Kconfig
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX
 2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa
 msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj
 GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi
 inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ
 lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv
 JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm
 Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw
 y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL
 orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2
 aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd
 ZCSnZ31h7woGfNho
 =D5B/
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Avoid 'constexpr', which is a keyword in C23

 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of
   'dt_binding_check'

 - Fix weak references to avoid GOT entries in position-independent code
   generation

 - Convert the last use of 'optional' property in arch/sh/Kconfig

 - Remove support for the 'optional' property in Kconfig

 - Remove support for Clang's ThinLTO caching, which does not work with
   the .incbin directive

 - Change the semantics of $(src) so it always points to the source
   directory, which fixes Makefile inconsistencies between upstream and
   downstream

 - Fix 'make tar-pkg' for RISC-V to produce a consistent package

 - Provide reasonable default coverage for objtool, sanitizers, and
   profilers

 - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.

 - Remove the last use of tristate choice in drivers/rapidio/Kconfig

 - Various cleanups and fixes in Kconfig

* tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits)
  kconfig: use sym_get_choice_menu() in sym_check_prop()
  rapidio: remove choice for enumeration
  kconfig: lxdialog: remove initialization with A_NORMAL
  kconfig: m/nconf: merge two item_add_str() calls
  kconfig: m/nconf: remove dead code to display value of bool choice
  kconfig: m/nconf: remove dead code to display children of choice members
  kconfig: gconf: show checkbox for choice correctly
  kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal
  Makefile: remove redundant tool coverage variables
  kbuild: provide reasonable defaults for tool coverage
  modules: Drop the .export_symbol section from the final modules
  kconfig: use menu_list_for_each_sym() in sym_check_choice_deps()
  kconfig: use sym_get_choice_menu() in conf_write_defconfig()
  kconfig: add sym_get_choice_menu() helper
  kconfig: turn defaults and additional prompt for choice members into error
  kconfig: turn missing prompt for choice members into error
  kconfig: turn conf_choice() into void function
  kconfig: use linked list in sym_set_changed()
  kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED
  kconfig: gconf: remove debug code
  ...
2024-05-18 12:39:20 -07:00
Linus Torvalds
70a663205d Probes updates for v6.10:
- tracing/probes: Adding new pseudo-types %pd and %pD support for dumping
   dentry name from 'struct dentry *' and file name from 'struct file *'.
 
 - uprobes: Some performance optimizations have been done.
  . Speed up the BPF uprobe event by delaying the fetching of the uprobe
    event arguments that are not used in BPF.
  . Avoid locking by speculatively checking whether uprobe event is valid.
  . Reduce lock contention by using read/write_lock instead of spinlock for
    uprobe list operation. This improved BPF uprobe benchmark result 43% on
    average.
 
 - rethook: Removes non-fatal warning messages when tracing stack from BPF
   and skip rcu_is_watching() validation in rethook if possible.
 
 - objpool: Optimizing objpool (which is used by kretprobes and fprobe as
   rethook backend storage) by inlining functions and avoid caching nr_cpu_ids
   because it is a const value.
 
 - fprobe: Add entry/exit callbacks types (code cleanup)
 - kprobes: Check ftrace was killed in kprobes if it uses ftrace.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmZFUxsbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b+fIH/A96/SeC5WRLhXmHfTCM
 IvKUea2n0b0oV/2pVfHqfkCBTICuUZ97Opd9VH9jLtjBOTh0fUOGZ2DNVGdSYfWm
 IIkS5dhuZxHXrSHEVYykwLHI3AOL7Q6Ny9EmOg1CNMidUkPMNtBvppsBYPlFU/B/
 qQJAvOdkVOnNITCaas0+MNgepoVVKdJzdNQ1I4WrGyG8isCZBaCYKo2QcGyheCNN
 y8NXvnVHgmgHQ8nTaeE5AawclFzFnhwHfPQPe1kiyGrx15b8K+VYmaZxPKv33A1a
 KT3TKJ1Ep7s7iWFh2iPVJzIwOXCmSnvNTKfNx/MDuKtO7UVfFwytoMEaekbmv3bG
 VqM=
 =n/mW
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

 - tracing/probes: Add new pseudo-types %pd and %pD support for dumping
   dentry name from 'struct dentry *' and file name from 'struct file *'

 - uprobes performance optimizations:
    - Speed up the BPF uprobe event by delaying the fetching of the
      uprobe event arguments that are not used in BPF
    - Avoid locking by speculatively checking whether uprobe event is
      valid
    - Reduce lock contention by using read/write_lock instead of
      spinlock for uprobe list operation. This improved BPF uprobe
      benchmark result 43% on average

 - rethook: Remove non-fatal warning messages when tracing stack from
   BPF and skip rcu_is_watching() validation in rethook if possible

 - objpool: Optimize objpool (which is used by kretprobes and fprobe as
   rethook backend storage) by inlining functions and avoid caching
   nr_cpu_ids because it is a const value

 - fprobe: Add entry/exit callbacks types (code cleanup)

 - kprobes: Check ftrace was killed in kprobes if it uses ftrace

* tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobe/ftrace: bail out if ftrace was killed
  selftests/ftrace: Fix required features for VFS type test case
  objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
  objpool: enable inlining objpool_push() and objpool_pop() operations
  rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
  ftrace: make extra rcu_is_watching() validation check optional
  uprobes: reduce contention on uprobes_tree access
  rethook: Remove warning messages printed for finding return address of a frame.
  fprobe: Add entry/exit callbacks types
  selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD"
  selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"
  Documentation: tracing: add new type '%pd' and '%pD' for kprobe
  tracing/probes: support '%pD' type for print struct file's name
  tracing/probes: support '%pd' type for print struct dentry's name
  uprobes: add speculative lockless system-wide uprobe filter check
  uprobes: prepare uprobe args buffer lazily
  uprobes: encapsulate preparation of uprobe args buffer
2024-05-17 18:29:30 -07:00
Linus Torvalds
ff2632d7d0 powerpc updates for 6.10
- Enable BPF Kernel Functions (kfuncs) in the powerpc BPF JIT.
 
  - Allow per-process DEXCR (Dynamic Execution Control Register) settings via
    prctl, notably NPHIE which controls hashst/hashchk for ROP protection.
 
  - Install powerpc selftests in sub-directories. Note this changes the way
    run_kselftest.sh needs to be invoked for powerpc selftests.
 
  - Change fadump (Firmware Assisted Dump) to better handle memory add/remove.
 
  - Add support for passing additional parameters to the fadump kernel.
 
  - Add support for updating the kdump image on CPU/memory add/remove events.
 
  - Other small features, cleanups and fixes.
 
 Thanks to: Andrew Donnellan, Andy Shevchenko, Aneesh Kumar K.V, Arnd Bergmann,
 Benjamin Gray, Bjorn Helgaas, Christian Zigotzky, Christophe Jaillet, Christophe
 Leroy, Colin Ian King, Cédric Le Goater, Dr. David Alan Gilbert, Erhard Furtner,
 Frank Li, GUO Zihua, Ganesh Goudar, Geoff Levand, Ghanshyam Agrawal, Greg Kurz,
 Hari Bathini, Joel Stanley, Justin Stitt, Kunwu Chan, Li Yang, Lidong Zhong,
 Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Matthias Schiffer,
 Naresh Kamboju, Nathan Chancellor, Nathan Lynch, Naveen N Rao, Nicholas
 Miehlbradt, Ran Wang, Randy Dunlap, Ritesh Harjani, Sachin Sant, Shirisha Ganta,
 Shrikanth Hegde, Sourabh Jain, Stephen Rothwell, sundar, Thorsten Blum, Vaibhav
 Jain, Xiaowei Bao, Yang Li, Zhao Chenhui.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmZHLtwTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgCGdD/0cqQkYl6+E0/K68Y7jnAWF+l0LNFlm
 /4jZ+zKXPiPhSdaQq4xo2ZjEooUPsm3c+AHidmrAtOMBULvv4pyciu61hrVu4Y2b
 aAudkBMUc+i/Lfaz7fq1KnN4LDFVm7xZZ+i/ju9tOBLMpOZ3YZ+YoOGA6nqsshJF
 XuB5h0T+H55he1wBpvyyrsUUyss53Mp3IsajxdwBOsUDDp0fSAg8SLEyhoiK3BsQ
 EjEa6iEqJSBheqFEXPvqsMuqM3k51CHe/pCOMODjo7P+u/MNrClZUscZKXGB5xq9
 Bu3SPxIYfRmU4XE53517faElEPmlxSBrjQGCD1EGEVXGsjn6r7TD6R5voow3SoUq
 CLTy90KNNrS1cIqeomu6bJ/anzYrViqTdekImA7Vb+Ol8f+uT9l+l1D75eYOKPQ3
 N0AHoa4rnWIb5kjCAjHaZ54O+B2q2tPlQqFUmt+BrvZyKS13zjE36stnArxP3MPC
 Xw6y3huX3AkZiJ4mQYRiBn//xGOLwrRCd/EoTDnoe08yq0Hoor6qIm4uEy2Nu3Kf
 0mBsEOxMsmQd6NEq43B/sFgVbbxKhAyxfZ9gHqxDQZcgoxXcMesyj/n4+jM5sRYK
 zmavLlykM2Tjlh1evs8+e0mCEwDjDn2GRlqstJQTrmnGhbMKi3jvw9I7gGtZVqbS
 kAflTXzsIXvxBA==
 =GoCV
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Enable BPF Kernel Functions (kfuncs) in the powerpc BPF JIT.

 - Allow per-process DEXCR (Dynamic Execution Control Register) settings
   via prctl, notably NPHIE which controls hashst/hashchk for ROP
   protection.

 - Install powerpc selftests in sub-directories. Note this changes the
   way run_kselftest.sh needs to be invoked for powerpc selftests.

 - Change fadump (Firmware Assisted Dump) to better handle memory
   add/remove.

 - Add support for passing additional parameters to the fadump kernel.

 - Add support for updating the kdump image on CPU/memory add/remove
   events.

 - Other small features, cleanups and fixes.

Thanks to Andrew Donnellan, Andy Shevchenko, Aneesh Kumar K.V, Arnd
Bergmann, Benjamin Gray, Bjorn Helgaas, Christian Zigotzky, Christophe
Jaillet, Christophe Leroy, Colin Ian King, Cédric Le Goater, Dr. David
Alan Gilbert, Erhard Furtner, Frank Li, GUO Zihua, Ganesh Goudar, Geoff
Levand, Ghanshyam Agrawal, Greg Kurz, Hari Bathini, Joel Stanley, Justin
Stitt, Kunwu Chan, Li Yang, Lidong Zhong, Madhavan Srinivasan, Mahesh
Salgaonkar, Masahiro Yamada, Matthias Schiffer, Naresh Kamboju, Nathan
Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Miehlbradt, Ran Wang,
Randy Dunlap, Ritesh Harjani, Sachin Sant, Shirisha Ganta, Shrikanth
Hegde, Sourabh Jain, Stephen Rothwell, sundar, Thorsten Blum, Vaibhav
Jain, Xiaowei Bao, Yang Li, and Zhao Chenhui.

* tag 'powerpc-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (85 commits)
  powerpc/fadump: Fix section mismatch warning
  powerpc/85xx: fix compile error without CONFIG_CRASH_DUMP
  powerpc/fadump: update documentation about bootargs_append
  powerpc/fadump: pass additional parameters when fadump is active
  powerpc/fadump: setup additional parameters for dump capture kernel
  powerpc/pseries/fadump: add support for multiple boot memory regions
  selftests/powerpc/dexcr: Fix spelling mistake "predicition" -> "prediction"
  KVM: PPC: Book3S HV nestedv2: Fix an error handling path in gs_msg_ops_kvmhv_nestedv2_config_fill_info()
  KVM: PPC: Fix documentation for ppc mmu caps
  KVM: PPC: code cleanup for kvmppc_book3s_irqprio_deliver
  KVM: PPC: Book3S HV nestedv2: Cancel pending DEC exception
  powerpc/xmon: Check cpu id in commands "c#", "dp#" and "dx#"
  powerpc/code-patching: Use dedicated memory routines for patching
  powerpc/code-patching: Test patch_instructions() during boot
  powerpc64/kasan: Pass virtual addresses to kasan_init_phys_region()
  powerpc: rename SPRN_HID2 define to SPRN_HID2_750FX
  powerpc: Fix typos
  powerpc/eeh: Fix spelling of the word "auxillary" and update comment
  macintosh/ams: Fix unused variable warning
  powerpc/Makefile: Remove bits related to the previous use of -mcmodel=large
  ...
2024-05-17 09:05:46 -07:00
Borislav Petkov (AMD)
9dba9c67e5 x86/alternatives: Use the correct length when optimizing NOPs
Commit in Fixes moved the optimize_nops() call inside apply_relocation()
and made it a second optimization pass after the relocations have been
done.

Since optimize_nops() works only on NOPs, that is fine and it'll simply
jump over instructions which are not NOPs.

However, it made that call with repl_len as the buffer length to
optimize.

However, it can happen that there are alternatives calls like this one:

  alternative("mfence; lfence", "", ALT_NOT(X86_FEATURE_APIC_MSRS_FENCE));

where the replacement length is 0. And using repl_len is wrong because
apply_alternatives() expands the buffer size to the length of the source
insn that is being patched, by padding it with one-byte NOPs:

	for (; insn_buff_sz < a->instrlen; insn_buff_sz++)
		insn_buff[insn_buff_sz] = 0x90;

Long story short: pass the length of the original instruction(s) as the
length of the temporary buffer which to optimize.

Result:

  SMP alternatives: feat: 11*32+27, old: (lapic_next_deadline+0x9/0x50 (ffffffff81061829) len: 6), repl: (ffffffff89b1cc60, len: 0) flags: 0x1
  SMP alternatives: ffffffff81061829:   old_insn: 0f ae f0 0f ae e8
  SMP alternatives: ffffffff81061829: final_insn: 90 90 90 90 90 90

=>

  SMP alternatives: feat: 11*32+27, old: (lapic_next_deadline+0x9/0x50 (ffffffff81061839) len: 6), repl: (ffffffff89b1cc60, len: 0) flags: 0x1
  SMP alternatives: ffffffff81061839: [0:6) optimized NOPs: 66 0f 1f 44 00 00
  SMP alternatives: ffffffff81061839:   old_insn: 0f ae f0 0f ae e8
  SMP alternatives: ffffffff81061839: final_insn: 66 0f 1f 44 00 00

Fixes: da8f9cf7e7 ("x86/alternatives: Get rid of __optimize_nops()")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240515104804.32004-1-bp@kernel.org
2024-05-17 09:27:06 +02:00
Stephen Brennan
1a7d0890dd kprobe/ftrace: bail out if ftrace was killed
If an error happens in ftrace, ftrace_kill() will prevent disarming
kprobes. Eventually, the ftrace_ops associated with the kprobes will be
freed, yet the kprobes will still be active, and when triggered, they
will use the freed memory, likely resulting in a page fault and panic.

This behavior can be reproduced quite easily, by creating a kprobe and
then triggering a ftrace_kill(). For simplicity, we can simulate an
ftrace error with a kernel module like [1]:

[1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer

  sudo perf probe --add commit_creds
  sudo perf trace -e probe:commit_creds
  # In another terminal
  make
  sudo insmod ftrace_killer.ko  # calls ftrace_kill(), simulating bug
  # Back to perf terminal
  # ctrl-c
  sudo perf probe --del commit_creds

After a short period, a page fault and panic would occur as the kprobe
continues to execute and uses the freed ftrace_ops. While ftrace_kill()
is supposed to be used only in extreme circumstances, it is invoked in
FTRACE_WARN_ON() and so there are many places where an unexpected bug
could be triggered, yet the system may continue operating, possibly
without the administrator noticing. If ftrace_kill() does not panic the
system, then we should do everything we can to continue operating,
rather than leave a ticking time bomb.

Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-16 07:23:30 +09:00
Bjorn Helgaas
850aae933c Revert "x86/apic/msi: Enable PCI/IMS"
This reverts commit 6e24c88773.

IMS (Interrupt Message Store) support appeared in v6.2, but there are no
users yet.

Remove it for now.  We can add it back when a user comes along.

Link: https://lore.kernel.org/r/20240410221307.2162676-7-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2024-05-15 17:02:04 -05:00
Linus Torvalds
f4b0c4b508 ARM:
* Move a lot of state that was previously stored on a per vcpu
   basis into a per-CPU area, because it is only pertinent to the
   host while the vcpu is loaded. This results in better state
   tracking, and a smaller vcpu structure.
 
 * Add full handling of the ERET/ERETAA/ERETAB instructions in
   nested virtualisation. The last two instructions also require
   emulating part of the pointer authentication extension.
   As a result, the trap handling of pointer authentication has
   been greatly simplified.
 
 * Turn the global (and not very scalable) LPI translation cache
   into a per-ITS, scalable cache, making non directly injected
   LPIs much cheaper to make visible to the vcpu.
 
 * A batch of pKVM patches, mostly fixes and cleanups, as the
   upstreaming process seems to be resuming. Fingers crossed!
 
 * Allocate PPIs and SGIs outside of the vcpu structure, allowing
   for smaller EL2 mapping and some flexibility in implementing
   more or less than 32 private IRQs.
 
 * Purge stale mpidr_data if a vcpu is created after the MPIDR
   map has been created.
 
 * Preserve vcpu-specific ID registers across a vcpu reset.
 
 * Various minor cleanups and improvements.
 
 LoongArch:
 
 * Add ParaVirt IPI support.
 
 * Add software breakpoint support.
 
 * Add mmio trace events support.
 
 RISC-V:
 
 * Support guest breakpoints using ebreak
 
 * Introduce per-VCPU mp_state_lock and reset_cntx_lock
 
 * Virtualize SBI PMU snapshot and counter overflow interrupts
 
 * New selftests for SBI PMU and Guest ebreak
 
 * Some preparatory work for both TDX and SNP page fault handling.
   This also cleans up the page fault path, so that the priorities
   of various kinds of fauls (private page, no memory, write
   to read-only slot, etc.) are easier to follow.
 
 x86:
 
 * Minimize amount of time that shadow PTEs remain in the special
   REMOVED_SPTE state.  This is a state where the mmu_lock is held for
   reading but concurrent accesses to the PTE have to spin; shortening
   its use allows other vCPUs to repopulate the zapped region while
   the zapper finishes tearing down the old, defunct page tables.
 
 * Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field,
   which is defined by hardware but left for software use.  This lets KVM
   communicate its inability to map GPAs that set bits 51:48 on hosts
   without 5-level nested page tables.  Guest firmware is expected to
   use the information when mapping BARs; this avoids that they end up at
   a legal, but unmappable, GPA.
 
 * Fixed a bug where KVM would not reject accesses to MSR that aren't
   supposed to exist given the vCPU model and/or KVM configuration.
 
 * As usual, a bunch of code cleanups.
 
 x86 (AMD):
 
 * Implement a new and improved API to initialize SEV and SEV-ES VMs, which
   will also be extendable to SEV-SNP.  The new API specifies the desired
   encryption in KVM_CREATE_VM and then separately initializes the VM.
   The new API also allows customizing the desired set of VMSA features;
   the features affect the measurement of the VM's initial state, and
   therefore enabling them cannot be done tout court by the hypervisor.
 
   While at it, the new API includes two bugfixes that couldn't be
   applied to the old one without a flag day in userspace or without
   affecting the initial measurement.  When a SEV-ES VM is created with
   the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are
   rejected once the VMSA has been encrypted.  Also, the FPU and AVX
   state will be synchronized and encrypted too.
 
 * Support for GHCB version 2 as applicable to SEV-ES guests.  This, once
   more, is only accessible when using the new KVM_SEV_INIT2 flow for
   initialization of SEV-ES VMs.
 
 x86 (Intel):
 
 * An initial bunch of prerequisite patches for Intel TDX were merged.
   They generally don't do anything interesting.  The only somewhat user
   visible change is a new debugging mode that checks that KVM's MMU
   never triggers a #VE virtualization exception in the guest.
 
 * Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
   L1, as per the SDM.
 
 Generic:
 
 * Use vfree() instead of kvfree() for allocations that always use vcalloc()
   or __vcalloc().
 
 * Remove .change_pte() MMU notifier - the changes to non-KVM code are
   small and Andrew Morton asked that I also take those through the KVM
   tree.  The callback was only ever implemented by KVM (which was also the
   original user of MMU notifiers) but it had been nonfunctional ever since
   calls to set_pte_at_notify were wrapped with invalidate_range_start
   and invalidate_range_end... in 2012.
 
 Selftests:
 
 * Enhance the demand paging test to allow for better reporting and stressing
   of UFFD performance.
 
 * Convert the steal time test to generate TAP-friendly output.
 
 * Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed
   time across two different clock domains.
 
 * Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT.
 
 * Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell
   script, to play nice with running in a minimal userspace environment.
 
 * Allow skipping the RSEQ test's sanity check that the vCPU was able to
   complete a reasonable number of KVM_RUNs, as the assert can fail on a
   completely valid setup.  If the test is run on a large-ish system that is
   otherwise idle, and the test isn't affined to a low-ish number of CPUs, the
   vCPU task can be repeatedly migrated to CPUs that are in deep sleep states,
   which results in the vCPU having very little net runtime before the next
   migration due to high wakeup latencies.
 
 * Define _GNU_SOURCE for all selftests to fix a warning that was introduced by
   a change to kselftest_harness.h late in the 6.9 cycle, and because forcing
   every test to #define _GNU_SOURCE is painful.
 
 * Provide a global pseudo-RNG instance for all tests, so that library code can
   generate random, but determinstic numbers.
 
 * Use the global pRNG to randomly force emulation of select writes from guest
   code on x86, e.g. to help validate KVM's emulation of locked accesses.
 
 * Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception
   handlers at VM creation, instead of forcing tests to manually trigger the
   related setup.
 
 Documentation:
 
 * Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmZE878UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOukQf+LcvZsWtrC7Wd5K9SQbYXaS4Rk6P6
 JHoQW2d0hUN893J2WibEw+l1J/0vn5JumqHXyZgJ7CbaMtXkWWQTwDSDLuURUKpv
 XNB3Sb17G87NH+s1tOh0tA9h5upbtlHVHvrtIwdbb9+XHgQ6HTL4uk+HdfO/p9fW
 cWBEZAKoWcCIa99Numv3pmq5vdrvBlNggwBugBS8TH69EKMw+V1Vu1SFkIdNDTQk
 NJJ28cohoP3wnwlIHaXSmU4RujipPH3Lm/xupyA5MwmzO713eq2yUqV49jzhD5/I
 MA4Ruvgrdm4wpp89N9lQMyci91u6q7R9iZfMu0tSg2qYI3UPKIdstd8sOA==
 =2lED
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:

   - Move a lot of state that was previously stored on a per vcpu basis
     into a per-CPU area, because it is only pertinent to the host while
     the vcpu is loaded. This results in better state tracking, and a
     smaller vcpu structure.

   - Add full handling of the ERET/ERETAA/ERETAB instructions in nested
     virtualisation. The last two instructions also require emulating
     part of the pointer authentication extension. As a result, the trap
     handling of pointer authentication has been greatly simplified.

   - Turn the global (and not very scalable) LPI translation cache into
     a per-ITS, scalable cache, making non directly injected LPIs much
     cheaper to make visible to the vcpu.

   - A batch of pKVM patches, mostly fixes and cleanups, as the
     upstreaming process seems to be resuming. Fingers crossed!

   - Allocate PPIs and SGIs outside of the vcpu structure, allowing for
     smaller EL2 mapping and some flexibility in implementing more or
     less than 32 private IRQs.

   - Purge stale mpidr_data if a vcpu is created after the MPIDR map has
     been created.

   - Preserve vcpu-specific ID registers across a vcpu reset.

   - Various minor cleanups and improvements.

  LoongArch:

   - Add ParaVirt IPI support

   - Add software breakpoint support

   - Add mmio trace events support

  RISC-V:

   - Support guest breakpoints using ebreak

   - Introduce per-VCPU mp_state_lock and reset_cntx_lock

   - Virtualize SBI PMU snapshot and counter overflow interrupts

   - New selftests for SBI PMU and Guest ebreak

   - Some preparatory work for both TDX and SNP page fault handling.

     This also cleans up the page fault path, so that the priorities of
     various kinds of fauls (private page, no memory, write to read-only
     slot, etc.) are easier to follow.

  x86:

   - Minimize amount of time that shadow PTEs remain in the special
     REMOVED_SPTE state.

     This is a state where the mmu_lock is held for reading but
     concurrent accesses to the PTE have to spin; shortening its use
     allows other vCPUs to repopulate the zapped region while the zapper
     finishes tearing down the old, defunct page tables.

   - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID
     field, which is defined by hardware but left for software use.

     This lets KVM communicate its inability to map GPAs that set bits
     51:48 on hosts without 5-level nested page tables. Guest firmware
     is expected to use the information when mapping BARs; this avoids
     that they end up at a legal, but unmappable, GPA.

   - Fixed a bug where KVM would not reject accesses to MSR that aren't
     supposed to exist given the vCPU model and/or KVM configuration.

   - As usual, a bunch of code cleanups.

  x86 (AMD):

   - Implement a new and improved API to initialize SEV and SEV-ES VMs,
     which will also be extendable to SEV-SNP.

     The new API specifies the desired encryption in KVM_CREATE_VM and
     then separately initializes the VM. The new API also allows
     customizing the desired set of VMSA features; the features affect
     the measurement of the VM's initial state, and therefore enabling
     them cannot be done tout court by the hypervisor.

     While at it, the new API includes two bugfixes that couldn't be
     applied to the old one without a flag day in userspace or without
     affecting the initial measurement. When a SEV-ES VM is created with
     the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected
     once the VMSA has been encrypted. Also, the FPU and AVX state will
     be synchronized and encrypted too.

   - Support for GHCB version 2 as applicable to SEV-ES guests.

     This, once more, is only accessible when using the new
     KVM_SEV_INIT2 flow for initialization of SEV-ES VMs.

  x86 (Intel):

   - An initial bunch of prerequisite patches for Intel TDX were merged.

     They generally don't do anything interesting. The only somewhat
     user visible change is a new debugging mode that checks that KVM's
     MMU never triggers a #VE virtualization exception in the guest.

   - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig
     VM-Exit to L1, as per the SDM.

  Generic:

   - Use vfree() instead of kvfree() for allocations that always use
     vcalloc() or __vcalloc().

   - Remove .change_pte() MMU notifier - the changes to non-KVM code are
     small and Andrew Morton asked that I also take those through the
     KVM tree.

     The callback was only ever implemented by KVM (which was also the
     original user of MMU notifiers) but it had been nonfunctional ever
     since calls to set_pte_at_notify were wrapped with
     invalidate_range_start and invalidate_range_end... in 2012.

  Selftests:

   - Enhance the demand paging test to allow for better reporting and
     stressing of UFFD performance.

   - Convert the steal time test to generate TAP-friendly output.

   - Fix a flaky false positive in the xen_shinfo_test due to comparing
     elapsed time across two different clock domains.

   - Skip the MONITOR/MWAIT test if the host doesn't actually support
     MWAIT.

   - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper
     shell script, to play nice with running in a minimal userspace
     environment.

   - Allow skipping the RSEQ test's sanity check that the vCPU was able
     to complete a reasonable number of KVM_RUNs, as the assert can fail
     on a completely valid setup.

     If the test is run on a large-ish system that is otherwise idle,
     and the test isn't affined to a low-ish number of CPUs, the vCPU
     task can be repeatedly migrated to CPUs that are in deep sleep
     states, which results in the vCPU having very little net runtime
     before the next migration due to high wakeup latencies.

   - Define _GNU_SOURCE for all selftests to fix a warning that was
     introduced by a change to kselftest_harness.h late in the 6.9
     cycle, and because forcing every test to #define _GNU_SOURCE is
     painful.

   - Provide a global pseudo-RNG instance for all tests, so that library
     code can generate random, but determinstic numbers.

   - Use the global pRNG to randomly force emulation of select writes
     from guest code on x86, e.g. to help validate KVM's emulation of
     locked accesses.

   - Allocate and initialize x86's GDT, IDT, TSS, segments, and default
     exception handlers at VM creation, instead of forcing tests to
     manually trigger the related setup.

  Documentation:

   - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits)
  selftests/kvm: remove dead file
  KVM: selftests: arm64: Test vCPU-scoped feature ID registers
  KVM: selftests: arm64: Test that feature ID regs survive a reset
  KVM: selftests: arm64: Store expected register value in set_id_regs
  KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope
  KVM: arm64: Only reset vCPU-scoped feature ID regs once
  KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()
  KVM: arm64: Rename is_id_reg() to imply VM scope
  KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
  KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
  KVM: arm64: Fix hvhe/nvhe early alias parsing
  KVM: SEV: Allow per-guest configuration of GHCB protocol version
  KVM: SEV: Add GHCB handling for termination requests
  KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests
  KVM: SEV: Add support to handle AP reset MSR protocol
  KVM: x86: Explicitly zero kvm_caps during vendor module load
  KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
  KVM: x86: Fully re-initialize supported_vm_types on vendor module load
  KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns
  KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
  ...
2024-05-15 14:46:43 -07:00
Linus Torvalds
a49468240e Modules changes for v6.10-rc1
Finally something fun. Mike Rapoport does some cleanup to allow us to
 take out module_alloc() out of modules into a new paint shedded execmem_alloc()
 and execmem_free() so to make emphasis these helpers are actually used outside
 of modules. It starts with a no-functional changes API rename / placeholders
 to then allow architectures to define their requirements into a new shiny
 struct execmem_info with ranges, and requirements for those ranges. Archs
 now can intitialize this execmem_info as the last part of mm_core_init() if
 they have to diverge from the norm. Each range is a known type clearly
 articulated and spelled out in enum execmem_type.
 
 Although a lot of this is major cleanup and prep work for future enhancements an
 immediate clear gain is we get to enable KPROBES without MODULES now. That is
 ultimately what motiviated to pick this work up again, now with smaller goal as
 concrete stepping stone.
 
 This has been sitting on linux-next for a little less than a month, a few issues
 were found already and fixed, in particular an odd mips boot issue. Arch folks
 reviewed the code too. This is ready for wider exposure and testing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmZDHfMSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinfIwP/iFsr89v9BjWdRTqzufuHwjOxvFymWxU
 BbEpOppRny3CckDU9ag9hLIlUaSL1Bg56Zb+znzp5stKOoiQYMDBvjSYdfybPxW2
 mRS6SClMF1ubWbzdysdp5Ld9u8T0MQPCLX+P2pKhZRGi0wjkBf5WEkTje+muJKI3
 4vYkXS7bNhuTwRQ+EGfze4+AeleGdQJKDWFY00TW9mZTTBADjfHyYU5o0m9ijf5l
 3V/weUznODvjVJStbIF7wEQ845Ae02LN1zXfsloIOuBMhcMju+x8IjPgPbD0KhX2
 yA48q7mVWkirYp0L5GSQchtqV1GBiP0NK1xXWEpyx6EqQZ4RJCsQhlhjijoExYBR
 ylP4bqiGVuE3IN075X0OzGCnmOStuzwssfDmug0sMAZH/MvmOQ21WzZdet2nLMas
 wwJArHqZsBI9BnBlvH9ZM4Y9f1zC7iR1wULaNGwXLPx34X9PIch8Yk+RElP1kMFQ
 +YrjOuWPjl63pmSkrkk+Pe2eesMPcPB41M6Q2iCjDlp0iBp63LIx2XISUbTf0ljM
 EsI4ZQseYpx+BmC7AuQfmXvEOjuXII9z072/artVWcB2u/87ixIprnqZVhcs/spy
 73DnXB4ufor2PCCC5Xrb/6kT6G+PzF3VwTbHQ1D+fYZ5n2qdyG+LKxgXbtxsRVTp
 oUg+Z/AJaCMt
 =Nsg4
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull modules updates from Luis Chamberlain:
 "Finally something fun. Mike Rapoport does some cleanup to allow us to
  take out module_alloc() out of modules into a new paint shedded
  execmem_alloc() and execmem_free() so to make emphasis these helpers
  are actually used outside of modules.

  It starts with a non-functional changes API rename / placeholders to
  then allow architectures to define their requirements into a new shiny
  struct execmem_info with ranges, and requirements for those ranges.

  Archs now can intitialize this execmem_info as the last part of
  mm_core_init() if they have to diverge from the norm. Each range is a
  known type clearly articulated and spelled out in enum execmem_type.

  Although a lot of this is major cleanup and prep work for future
  enhancements an immediate clear gain is we get to enable KPROBES
  without MODULES now. That is ultimately what motiviated to pick this
  work up again, now with smaller goal as concrete stepping stone"

* tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
  kprobes: remove dependency on CONFIG_MODULES
  powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate
  x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  arch: make execmem setup available regardless of CONFIG_MODULES
  powerpc: extend execmem_params for kprobes allocations
  arm64: extend execmem_info for generated code allocations
  riscv: extend execmem_params for generated code allocations
  mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  mm/execmem, arch: convert simple overrides of module_alloc to execmem
  mm: introduce execmem_alloc() and execmem_free()
  module: make module_memory_{alloc,free} more self-contained
  sparc: simplify module_alloc()
  nios2: define virtual address space for modules
  mips: module: rename MODULE_START to MODULES_VADDR
  arm64: module: remove unneeded call to kasan_alloc_module_shadow()
  kallsyms: replace deprecated strncpy with strscpy
  module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.
2024-05-15 14:05:08 -07:00
Jani Nikula
d2c4b1db1c drm/i915/pciids: don't include RPL-U PCI IDs in RPL-P
It's confusing for INTEL_RPLP_IDS() to include INTEL_RPLU_IDS(). Even if
we treat them the same elsewhere, the lists of PCI IDs should not.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patchwork.freedesktop.org/patch/msgid/28fe0910efb93a28c400728af14beff015667f42.1715340032.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-05-15 19:04:08 +03:00
Jani Nikula
7858cc0b55 drm/i915/pciids: remove 12 from INTEL_TGL_IDS()
Most other PCI ID macros do not encode the gen in the name. Follow suit
for TGL.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patchwork.freedesktop.org/patch/msgid/044a5c553dc4564431bbef197d5e2dd085624fc2.1715340032.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-05-15 19:04:07 +03:00
Jani Nikula
bfbda47227 drm/i915/pciids: remove 11 from INTEL_ICL_IDS()
Most other PCI ID macros do not encode the gen in the name. Follow suit
for ICL.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patchwork.freedesktop.org/patch/msgid/36973674bf333dfdd7cd32ae656754bfa150022b.1715340032.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-05-15 19:04:07 +03:00
Jani Nikula
aa3d586e16 drm/i915/pciids: don't include WHL/CML PCI IDs in CFL
It's confusing for INTEL_CFL_IDS() to include all WHL and CML PCI
IDs. Even if we treat them the same in a lot of places, CML is a
platform of its own, and the lists of PCI IDs should not conflate them.

Largely go by the idea that if a platform has a name, group its PCI IDs
together.

That said, AML is special, having both KBL and CFL variants. Leave that
alone.

v2: Also split out WHL not just CML (Rodrigo)

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patchwork.freedesktop.org/patch/msgid/7cca91dc78ed2b5982f14e400f03a1704645e475.1715340032.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-05-15 19:04:07 +03:00
Jani Nikula
5c8c22adc8 drm/i915/pciids: add INTEL_IVB_IDS()
Add INTEL_IVB_IDS() to identify all IVBs except IVB Q transcode.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ed89a25b2c6bce318fe59e883d18b62d9453196b.1715340032.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-05-15 19:04:07 +03:00
Jani Nikula
7b43a37348 drm/i915/pciids: add INTEL_SNB_IDS()
Add INTEL_SNB_IDS() to identify all SNBs.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ffcb2d954ad9bca78ccd39836dc0a3dc7c6c0253.1715340032.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-05-15 19:04:07 +03:00
Jani Nikula
41c0f8a36f drm/i915/pciids: add INTEL_ILK_IDS(), use acronym
Most other PCI ID macros use platform acronyms. Follow suit for ILK. Add
INTEL_ILK_IDS() to identify all ILKs.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patchwork.freedesktop.org/patch/msgid/27ada56363cfa6a5b093cb31908a4b89aa912621.1715340032.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-05-15 19:04:07 +03:00
Jani Nikula
432ed92bfb drm/i915/pciids: add INTEL_PNV_IDS(), use acronym
Most other PCI ID macros use platform acronyms. Follow suit for PNV. Add
INTEL_PNV_IDS() to identify all PNVs.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patchwork.freedesktop.org/patch/msgid/5f9b34a2cd388244be03263a5147776bfe64d5ac.1715340032.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-05-15 19:04:07 +03:00
Linus Torvalds
9776dd3609 X86 interrupt handling update:
Support for posted interrupts on bare metal
 
     Posted interrupts is a virtualization feature which allows to inject
     interrupts directly into a guest without host interaction. The VT-d
     interrupt remapping hardware sets the bit which corresponds to the
     interrupt vector in a vector bitmap which is either used to inject the
     interrupt directly into the guest via a virtualized APIC or in case
     that the guest is scheduled out provides a host side notification
     interrupt which informs the host that an interrupt has been marked
     pending in the bitmap.
 
     This can be utilized on bare metal for scenarios where multiple
     devices, e.g. NVME storage, raise interrupts with a high frequency.  In
     the default mode these interrupts are handles independently and
     therefore require a full roundtrip of interrupt entry/exit.
 
     Utilizing posted interrupts this roundtrip overhead can be avoided by
     coalescing these interrupt entries to a single entry for the posted
     interrupt notification. The notification interrupt then demultiplexes
     the pending bits in a memory based bitmap and invokes the corresponding
     device specific handlers.
 
     Depending on the usage scenario and device utilization throughput
     improvements between 10% and 130% have been measured.
 
     As this is only relevant for high end servers with multiple device
     queues per CPU attached and counterproductive for situations where
     interrupts are arriving at distinct times, the functionality is opt-in
     via a kernel command line parameter.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmZBGUITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYod3xD/98Xa4qZN7eceyyGUhgXnPLOKQzGQ7k
 7cmhsoAYjABeXLvuAvtKePL7ky7OPcqVW2E/g0+jdZuRDkRDbnVkM7CDMRTyL0/b
 BZLhVAXyANKjK79a5WvjL0zDasYQRQ16MQJ6TPa++mX0KhZSI7KvXWIqPWov5i02
 n8UbPUraH5bJi3qGKm6u4n2261Be1gtDag0ZjmGma45/3wsn3bWPoB7iPK6qxmq3
 Q7VARPXAcRp5wYACk6mCOM1dOXMUV9CgI5AUk92xGfXi4RAdsFeNSzeQWn9jHWOf
 CYbbJjNl4QmGP4IWmy6/Up4vIiEhUCOT2DmHsygrQTs/G+nPnMAe1qUuDuECiofj
 iToBL3hn1dHG8uINKOB81MJ33QEGWyYWY8PxxoR3LMTrhVpfChUlJO8T2XK5nu+i
 2EA6XLtJiHacpXhn8HQam0aQN9nvi4wT1LzpkhmboyCQuXTiXuJNbyLIh5TdFa1n
 DzqAGhRB67z6eGevJJ7kTI1X71W0poMwYlzCU8itnLOK8np0zFQ8bgwwqm9opZGq
 V2eSDuZAbqXVolzmaF8NSfM+b/R9URQtWsZ8cEc+/OdVV4HR4zfeqejy60TuV/4G
 39CTnn8vPBKcRSS6CAcJhKPhzIvHw4EMhoU4DJKBtwBdM58RyP9NY1wF3rIPJIGh
 sl61JBuYYuIZXg==
 =bqLN
 -----END PGP SIGNATURE-----

Merge tag 'x86-irq-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 interrupt handling updates from Thomas Gleixner:
 "Add support for posted interrupts on bare metal.

  Posted interrupts is a virtualization feature which allows to inject
  interrupts directly into a guest without host interaction. The VT-d
  interrupt remapping hardware sets the bit which corresponds to the
  interrupt vector in a vector bitmap which is either used to inject the
  interrupt directly into the guest via a virtualized APIC or in case
  that the guest is scheduled out provides a host side notification
  interrupt which informs the host that an interrupt has been marked
  pending in the bitmap.

  This can be utilized on bare metal for scenarios where multiple
  devices, e.g. NVME storage, raise interrupts with a high frequency. In
  the default mode these interrupts are handles independently and
  therefore require a full roundtrip of interrupt entry/exit.

  Utilizing posted interrupts this roundtrip overhead can be avoided by
  coalescing these interrupt entries to a single entry for the posted
  interrupt notification. The notification interrupt then demultiplexes
  the pending bits in a memory based bitmap and invokes the
  corresponding device specific handlers.

  Depending on the usage scenario and device utilization throughput
  improvements between 10% and 130% have been measured.

  As this is only relevant for high end servers with multiple device
  queues per CPU attached and counterproductive for situations where
  interrupts are arriving at distinct times, the functionality is opt-in
  via a kernel command line parameter"

* tag 'x86-irq-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Use existing helper for pending vector check
  iommu/vt-d: Enable posted mode for device MSIs
  iommu/vt-d: Make posted MSI an opt-in command line option
  x86/irq: Extend checks for pending vectors to posted interrupts
  x86/irq: Factor out common code for checking pending interrupts
  x86/irq: Install posted MSI notification handler
  x86/irq: Factor out handler invocation from common_interrupt()
  x86/irq: Set up per host CPU posted interrupt descriptors
  x86/irq: Reserve a per CPU IDT vector for posted MSIs
  x86/irq: Add a Kconfig option for posted MSI
  x86/irq: Remove bitfields in posted interrupt descriptor
  x86/irq: Unionize PID.PIR for 64bit access w/o casting
  KVM: VMX: Move posted interrupt descriptor out of VMX code
2024-05-14 10:01:29 -07:00
Linus Torvalds
a9d9ce3fbc A single update for the TSC synchronixation sanity checks:
The sad state of TSC being notoriously non-sychronized for several
   decades caused the kernel to grow quite rigorous sanity checks to detect
   whether the TSC is valid to be used for timekeeping.
 
   The TSC ADJUST MSR provides the offset between the initial TSC value
   after hardware reset and later modifications. This allows to detect cases
   where firmware tampers with the TSC and also allows to correct the
   firmware induced damage by resetting the offset in a controlled way.
 
   The universal correct rule is that the TSC ADJUST value has to be
   consistent within all CPUs of a socket.
 
   The kernel further assumes that the TSC offset should be consistent
   between sockets. That's not really correct as systems with a huge number
   of sockets are not architecurally guaranteed to reset the per socket TSC
   base synchronously.
 
   In case that the per socket offset is not consistent the kernel resets it
   to the offset of the boot CPU and then does a synchronization check which
   corrects for the inter socket delays.
 
   That works most of the time, but it is suboptimal as the firmware has
   eventually better information about the per socket offset and on sane
   systems that offset should just work in the validation checks.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmZCDA4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodLcD/4hZmCiEL97M+qb0rjmscKmJq/EOjxm
 pRgT/+vH2MakYh2xIjLSeMtRB5eFdfz+ZspJGEt017yW5l+saZ6edrq+g2qi1EfJ
 TDGbDGK9T6HR0WDplFLqLXolKS2lcvHbolATu/t5ZQmrRmuGuS+6t6eAoI9QcaWQ
 DaqMtdSQNq8B5hopaZtaJSTTkznD/CtKyMCvVKGxXE2gH6d6UmezR72f6oruzgg9
 WXLDt1sPxg1zl1rS1GdeRa4xXrsLxr8THZ53Nr5pPyZV6FCSOZQtcurwhsIYcMO7
 b3m+LU04XGURK196c0Uej8UwRCAHpD50aS91GcclXsR4wTKyFatVz9mpwZOK/F/L
 Pw+5O6xUeyIAKMr6YJl6KusPhhwDcYm+ETuTzMmWMyJEh91lLYHyCKniE3wsHpzT
 L7er6HWOwBaPlOnvuOhl4rzqr0F+9xLmWWq6s+85HlvlgfV1NjEhqi7dn/ZO1jdx
 Im3Xq8sq04tIMNjLPSTkovXmvU2us45yQk2HthWSM7FQ+vpzPDgdp1sVFsLK19cu
 +t0jI01qSUBzYvcM28CDX99hEI2L8Oo/nC1/Vq/4MB+KkEPCUMKr0ZA2nTKHL8lx
 +lOGdnzokr6DsbRdtfqWKywc/is2r1OXrOSBR23SwwK1XQ2aRMi/0F96Q0cR9lzt
 6utxZRFhc7BtpA==
 =W1qL
 -----END PGP SIGNATURE-----

Merge tag 'x86-timers-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 timers update from Thomas Gleixner:
 "A single update for the TSC synchronixation sanity checks:

  The sad state of TSC being notoriously non-sychronized for several
  decades caused the kernel to grow quite rigorous sanity checks to
  detect whether the TSC is valid to be used for timekeeping.

  The TSC ADJUST MSR provides the offset between the initial TSC value
  after hardware reset and later modifications. This allows to detect
  cases where firmware tampers with the TSC and also allows to correct
  the firmware induced damage by resetting the offset in a controlled
  way.

  The universal correct rule is that the TSC ADJUST value has to be
  consistent within all CPUs of a socket.

  The kernel further assumes that the TSC offset should be consistent
  between sockets. That's not really correct as systems with a huge
  number of sockets are not architecurally guaranteed to reset the per
  socket TSC base synchronously.

  In case that the per socket offset is not consistent the kernel resets
  it to the offset of the boot CPU and then does a synchronization check
  which corrects for the inter socket delays.

  That works most of the time, but it is suboptimal as the firmware has
  eventually better information about the per socket offset and on sane
  systems that offset should just work in the validation checks"

* tag 'x86-timers-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tsc: Trust initial offset in architectural TSC-adjust MSRs
2024-05-14 09:41:26 -07:00
Linus Torvalds
61deafa9ec Improve data types to fix Coccinelle division warnings
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmZCbbIACgkQaDWVMHDJ
 krC5Dw/9GsRdLN4lGkUhJJ9p1iNbB3qXiOgNIS9ROsh0VHI0FXf36XTsuoWE5+nT
 eXb/4s0oWApqs2YbriquILfk7Q7fj0/fOYIJJ182sB/WlXBWNIjoR42rNPqDbxA8
 yJVoY+RuJ1TClMYMOt6vN3hLbcOd7TU61mXmZexvI48p4GsKKYjqh9LHcaLuTUFU
 Ibo4tJKxJQRCvY1pX9aJ6ICkMBpewFTrNcnm5iDnGGG+7f73Pq7+9ZdhWjPPt3K5
 u/BaQtjTvPB7PlwiKV29c37Ud6FMv/CwOI9rdwRrJRHog6jzqcJd5IV2XlfaYoXL
 se0yOptSP96MljguWFJ2yIynWtFYJqsmrHOSvytT9VxM10izUqEhU+qrkWzP1sqR
 NN701cH4FUZV9odYx5kwhquAVecgpW//W47hp4Ghq3BBQvX8Y158Blqem+VDZzYU
 6fxz/IxX1SURN/yEAompReMdoBdI+vFwwM3ZKCLMY3ZfUaDaZiWGXjhSBHzzGSVD
 tg3Clq7AXHcVbRe8h1d0CTh4O8XRLMJDcUM5rq06agrDuJf4F8tXqNZoYSku4Haw
 dd+U61bWAOFwv7MkePNjg8xglBSV3gyS4uoCxolZvScHz/nxLD+wLWnb6Z5oMlLn
 ZroyXUMM2FMhoK4ODp3l1OzSRMj780QCvJCRW6SeJ4u2wuCoXdE=
 =4t5X
 -----END PGP SIGNATURE-----

Merge tag 'x86_apic_for_6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 APIC update from Dave Hansen:
 "Coccinelle complained about some 64-bit divisions, but the divisor was
  really just a 32-bit value being stored as 'unsigned long'.

  Fixing the types fixes the warning"

* tag 'x86_apic_for_6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic: Improve data types to fix Coccinelle warnings
2024-05-14 09:24:14 -07:00
Linus Torvalds
964bbdfdf0 - Small cleanups and improvements
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmZCcL8ACgkQEsHwGGHe
 VUo1NxAAuKdh6dT2I2qMEsrYcG505DNI13AZ8Zp9w+3f7ehVzi5EAETx0c9JCh6i
 brTlkAerydTGIErCE/agNvbDHPxJDUjMPvOsUHCiuvcChbJSEsZ5KpmbHX2rLQCq
 znRS+51PmoRg9EmscqW898qi7jWklgy2ZaeFyZGNx7stlcjc/C4pgfMPt6UJqIiO
 WeqSTSGeAKq/wsSpx0Fm3Ize6HZGAGTlkHSKE1XllvuDigDhPnBa8O1g0iyoyFHl
 YHOMHSUZ5G/hqtOzCPMnAvLPEta8EcJZrhGYhQguDNk02a3LHfkitVPC2FeJk/Zy
 jp2KESkHjWiEvkw3myazpONYY8Z6Fw5GZWvR5EBBhgv285viNUQBRoch4xdKjHCb
 230LVVvdzZ8iOUx0Im6f9Ec6oYB9hXxdFr7YnkPPBPf3VU22H3i1meE294pkZUbq
 2wFAWlIi8CbbAPNEqmPjVEyxGqsc+ZJt7/yge3iiJqcQdubMVCX8drfAhcI84QjO
 mmcwcQ3BT3ugsKaKSQuUFUdqBrHKgcQ2aMOeyMUkBs1UANZlOBbRaTdTubPzL5cj
 G4pJcH/dRHSktWTn01SHDpxIhbSDdG7c4jHOzIio86vn0ahbrCSAzp6Y9nP4YkZm
 jdHZAI6yZSA3FF3vtBpkTatPOYRb9lgFMNDoxTVr62F7UfgkBS8=
 =Z3p4
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - Small cleanups and improvements

* tag 'x86_sev_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev: Make the VMPL0 checking more straight forward
  x86/sev: Rename snp_init() in boot/compressed/sev.c
  x86/sev: Shorten struct name snp_secrets_page_layout to snp_secrets_page
2024-05-14 09:18:52 -07:00
Linus Torvalds
a1907ccdfe - Fix a clang-15 build warning and other cleanups
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmZCZ8cACgkQEsHwGGHe
 VUqnKxAAp4wkGy6989NUjBPScXneCbT5kQv7bngHI7JMgVTwhd+PQ75gCc3dAJsX
 BhOYfDdzw3ZONe4isGwnMHurtFh3Eh7xktgKr3DhGxGZthAnnp6Kyw3ODo+hbXyl
 B9aiVd6DaxLdIMvNBM5Rft3pPDrW62XnYVrjpF9Ta36jYN88kzUa263sQkenSY88
 moI0oXlC1YDHr9mG6VHCqDSj4rLZa76brOGyL60dhZ+L59rQ6rOCbiZJOKlDxzgM
 gdis7QT+ZdbjPFdb8Yv2JsGMtLS0aquWJkVwa1GdC5iDe28xfusyfi/4Dvq4ZRLF
 DIBAjdczeClwhCm05gxVI3DA0hExgGrq27foGa2MIwks9mtIOYNYC1d5sdVVHYr3
 WV/5bsZNbSqNUQgpqYXM8VVhRPkTzx1JWBdG/UXOouZ7Ej3wJ0ls7cSzP74UehIo
 gN8AxtmTXjREpvRGUsiOqPEanbqudHqjxcvugdqZ8jd0jrqNG/bRtPwfMnUgNzJg
 srFq38jSjy7vIFNDFWgls4tvFSVuvPR8OFqYjrSWRyroDDgqFTqVhieCGpAdt0+6
 d+fN9yqSk7ai1GT0iPa+9yQsZXwY/e3sVYQsvO7bUnt6ulHCWuFvNXilryGHfcYO
 R4Ud++d/ckj1TBrHG6Ic9aPb5IWrMWfq+JA0gaxpChusd6Zf+e8=
 =K7vs
 -----END PGP SIGNATURE-----

Merge tag 'x86_microcode_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 microcode loader updates from Borislav Petkov:

 - Fix a clang-15 build warning and other cleanups

* tag 'x86_microcode_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode: Remove unused struct cpu_info_ctx
  x86/microcode/AMD: Remove unused PATCH_MAX_SIZE macro
  x86/microcode/AMD: Avoid -Wformat warning with clang-15
2024-05-14 09:09:32 -07:00
Linus Torvalds
5186ba3323 - Add a tracepoint to read out LLC occupancy of resource monitor IDs with the
goal of freeing them sooner rather than later
 
 - Other code improvements and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmZCW64ACgkQEsHwGGHe
 VUq7Dw//ZM+4OX3l0P6NTv4WJ9UDn3IltRm+D61J6hYw19iETlGGAel5T6DI1LPT
 GYAoOazd9ouNjwU0YhOn6Se3SVWKxLLOGH+/RIJtqwiCwTy2nGfSPHw3pnTxwtK4
 pRttm6fPQWIUuQyDrzmbJGP+va4YDtVtDyBkxNlk8pQTvF7X0QCcu6GjNW9r6+Md
 92J2AwzeoDAeIc16vKHru4S3wBCqdP7xZ9GqBb8wrNxBy8taSN4wE9cuwDjev5Yw
 ANGeREv3odWvYQ7p0fQVY2j25ddjGNE4qEEJ1iAIJDh9bIHURAF3s1aSPqcMyHyF
 eB8NNf7ZjQhycmBX9ci6CHYOKc3i25nWiMoaC1iWZKQEviTt3OCEeKr20mjAfKOz
 wlUs55iGrHkbS10kB91Z6lOMDNiIu+x4kuiF5y1W73SDfkY+pYv8zLQL9rhNpYnd
 BEcOF+YaJuhi4Y7GUDb0fWdIUZcfGItSJyNbR8jaznJKcP2pjznSUKqM/AphZyuU
 bVsVsYkYQiE2vl4xYdmyHnxsfnpuMTVNuPpIonyp1mIa77iDVeiwYabkau+pz8L9
 Rv1jhUmYVfawxKiRc6tOQAsxOtAiqrm2GBpZlisw8KtfzZaPC9h7U7bXC4up1TtH
 nZVt+qV/8M9nc3Trocb+d8djbrv+Uqh4EHPTBbFEfW6qsMFsXhk=
 =8EKr
 -----END PGP SIGNATURE-----

Merge tag 'x86_cache_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 resource control updates from Borislav Petkov:

 - Add a tracepoint to read out LLC occupancy of resource monitor IDs
   with the goal of freeing them sooner rather than later

 - Other code improvements and cleanups

* tag 'x86_cache_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Add tracepoint for llc_occupancy tracking
  x86/resctrl: Rename pseudo_lock_event.h to trace.h
  x86/resctrl: Simplify call convention for MSR update functions
  x86/resctrl: Pass domain to target CPU
2024-05-14 09:04:37 -07:00
Linus Torvalds
25c7cb05fa - Switch the in-place instruction patching which lead to at least one weird bug
with 32-bit guests, seeing stale instruction bytes, to one working on
   a buffer, like the rest of the alternatives code does
 
 - Add a long overdue check to the X86_FEATURE flag modifying functions to warn
   when former get changed in a non-compatible way after alternatives have been
   patched because those changes will be already wrong
 
 - Other cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmZCT+kACgkQEsHwGGHe
 VUraaxAAl6pwAVD19vK6VtTRxgKGW8GBaGjdtSBDSFP3dhyvqd+xC1Vez5HKShMz
 Lmg81ZsoeAruGWDo+Av0twgGEd5OagTMHdrJsfWVQlaVXNE1IPm4tWuic4Llh+0X
 LSZYrBXpQH7/bsOHFTdvun8NdHVb5Ew8pvYCB06lPrlU7sjBujGsFzyQ1R6xNWmr
 IErYqUVtEqexNS9lo45N+1Q5Uzdb9eNnPqMDA0ZbvJEytXWHlqW3ukOjRyNls1BS
 HbgIqOk59xuHII/nw+GgsXant2TvJQYFJPC7CculJWp7oLZITn03rj0AMKOS7cm+
 zOKDbnvQogw4mf/eVc1X6RbIq+9O5eZcBskIiRVGpFP294Axt8gEwmFcfBI2UsUF
 t73Z2ELHuo/iHc02Gd2y+uV98NEmluX+g4efb5ILpdMJiP9J2rl6TA0PIYUx8U3T
 794We38nk1YCSZnXZOpso7y+m/lRPocALWHQdtw9Frn8UNzgjidpef8vT2O+Trp5
 AYv5ucnChjcUQycMIBGFbqppwjs9vb2y1L6mh4mCB6WrxeAitUw0hjvuYQvKL+wB
 0gYqOrL4Z+swYKMC+GAE5HCcQayzsURbjnyzcM4nhKGSiwpaeYKHPqAPPq+oyH18
 xMc8KI3n791oeZBUhA5o1ECw5vX3FcgUfAmlYfhMTnqvo+UQALM=
 =gaPh
 -----END PGP SIGNATURE-----

Merge tag 'x86_alternatives_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 asm alternatives updates from Borislav Petkov:

 - Switch the in-place instruction patching which lead to at least one
   weird bug with 32-bit guests, seeing stale instruction bytes, to one
   working on a buffer, like the rest of the alternatives code does

 - Add a long overdue check to the X86_FEATURE flag modifying functions
   to warn when former get changed in a non-compatible way after
   alternatives have been patched because those changes will be already
   wrong

 - Other cleanups

* tag 'x86_alternatives_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternatives: Remove alternative_input_2()
  x86/alternatives: Sort local vars in apply_alternatives()
  x86/alternatives: Optimize optimize_nops()
  x86/alternatives: Get rid of __optimize_nops()
  x86/alternatives: Use a temporary buffer when optimizing NOPs
  x86/alternatives: Catch late X86_FEATURE modifiers
2024-05-14 08:51:37 -07:00
Linus Torvalds
b4864f6565 - Change the fixed-size buffer for MCE records to a dynamically sized
one based on the number of CPUs present in the system
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmZB9z8ACgkQEsHwGGHe
 VUrOxA/+Mh3eCUMzgqzXRf5PVdDUQO2BBrzQEriWU0PwPjOdqmBtx6l5hlfwAl/Q
 4I200RAjCu36V4BN65xhtkdQ20mKtAFXfYlqCSp4C3Q3dxSLt8P/7nwWgDqZ/ry+
 IPuB4fs4GPGoolrV7wKn2IYJLPtn44Ef9kEUH2j+Za2f6GYEdD4j0IWsD1+VZwGL
 jatFbmPkZQXfYwPOvN2cfF5EMq84XNo3rM82++JcwvdbrbkqO2mT4OWZ6pWylD0x
 tiewi3HbVKDDUItv/bTj9QtPqbYfbENHroz3gdwo066F2OZiEA5cn7lPhL05DBYH
 FmmicH2yNKAvZlhP/m6YAz+b6H/nLihPen1wcbe+BzJYKJJgDz87QDWrsqbOiBIr
 1tamd5hVZZ+XHXLQv140BsetwwZhnrO4N4PtwZNXUw8sehreErIKyEsRy6DIXKYf
 nY+Z6NMopyatOnAKd2vhW2wjiAFhQvkKmM4Dlw/VEzTbg7xoXruwKCiulxNrmgnX
 eAOHErsv9GF+1ZlnXLoTBo+ctLS1xgDu1GvlXlxGo2Ei2WkHmyzrKVcWZoNXCSgB
 Mnpt3Nuzv1dAmGEnZZjotdbm4kSKn3By7pDeDbhynaSepx0G2T/4tvXiyXkoepnq
 wJ21MATXUOE8Qq5d+D3V4brC7avcqI8vl+tb7Qi7JK0K3Dv2Wd0=
 =enB0
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS update from Borislav Petkov:

 - Change the fixed-size buffer for MCE records to a dynamically sized
   one based on the number of CPUs present in the system

* tag 'ras_core_for_v6.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Dynamically size space for machine check records
2024-05-14 08:39:42 -07:00
Mike Rapoport (IBM)
14e56fb2ed x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
Dynamic ftrace must allocate memory for code and this was impossible
without CONFIG_MODULES.

With execmem separated from the modules code, execmem_text_alloc() is
available regardless of CONFIG_MODULES.

Remove dependency of dynamic ftrace on CONFIG_MODULES and make
CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:44 -07:00
Mike Rapoport (IBM)
0cc2dc4902 arch: make execmem setup available regardless of CONFIG_MODULES
execmem does not depend on modules, on the contrary modules use
execmem.

To make execmem available when CONFIG_MODULES=n, for instance for
kprobes, split execmem_params initialization out from
arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:44 -07:00
Mike Rapoport (IBM)
223b5e57d0 mm/execmem, arch: convert remaining overrides of module_alloc to execmem
Extend execmem parameters to accommodate more complex overrides of
module_alloc() by architectures.

This includes specification of a fallback range required by arm, arm64
and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
allocation of KASAN shadow required by s390 and x86 and support for
late initialization of execmem required by arm64.

The core implementation of execmem_alloc() takes care of suppressing
warnings when the initial allocation fails but there is a fallback range
defined.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Tested-by: Liviu Dudau <liviu@dudau.co.uk>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Mike Rapoport (IBM)
12af2b83d0 mm: introduce execmem_alloc() and execmem_free()
module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing execmem_alloc()
and execmem_free() APIs.

Initially, execmem_alloc() is a wrapper for module_alloc() and
execmem_free() is a replacement of module_memfree() to allow updating all
call sites to use the new APIs.

Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem_alloc() takes
a type argument, that will be used to identify the calling subsystem and to
allow architectures define parameters for ranges suitable for that
subsystem.

No functional changes.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Linus Torvalds
a5131c3fdf Enable shadow stacks for x32.
While we normally don't do such feature-enabling on 32-bit
 kernels anymore, this change is small, straightforward & tested on
 upstream glibc.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZByx4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1htKA/9EXRmI4498QY07EaqkrvGdzDnPnvQgHLv
 JoPvMM73pCY8FyOt0D/fRLeS/JvP0zGnn6HT55LeQkJVcWUIAdDhuniDBSjxS7xd
 BdwHzkfJn2qa6kA1ekXWS7zHY+D1hsJEq9/15gFj/q2JVfo+HyN768fHS6kohdkW
 aNneAbsVOJZNxmKNVXXiC69xhDNVyjFxEJ0xP7rUctjj4GvJRg14pt95//z+YnNB
 qKmmd1/+ul652rZzsFbDjB9PZkkixm8qALFDR7I94UWX3MYknpTcV+n/tFSykQrv
 z3nabF+pTHKSJDrtGVOC4ks+SofK2wwEg4vYC2mfCWtVcZfPoEfEIVum6VbmfW8J
 2sr1hfydTRycA6i90U2IjbnyYCtQsXyzyHGuJI4JplDinHu+GxiQQ9xMU7nmdlA+
 xXazqk8dciMpzPJY8pUv0JXurNFfq/n6BfYTYrBsBeRCm8gcyYFB7fTkJWamowWc
 DhXHOz/MC6BkZhgkoB1/L9i9GgMu9boCJ1vdcnUMBZfqWVlcePlspOtUtabhvF2r
 8NKKLwTtdcgGswrBmVcWZhbwRuc9imK3uAoNlSIEe5jC8rlcp7F5lnpYF2DPFnYn
 VCeGfoQGdJyt8D+9Ag7wm9zseMRekdI8dABJW2ZVAmq810+6PSW4ToONwlqzfL63
 uTcapyAC0qQ=
 =AqG9
 -----END PGP SIGNATURE-----

Merge tag 'x86-shstk-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 shadow stacks from Ingo Molnar:
 "Enable shadow stacks for x32.

  While we normally don't do such feature-enabling for 32-bit anymore,
  this change is small, straightforward & tested on upstream glibc"

* tag 'x86-shstk-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/shstk: Enable shadow stacks for x32
2024-05-13 19:33:23 -07:00
Linus Torvalds
5f487cd829 x86/platform changes for v6.10:
- Improve the DeviceTree (OF) NUMA enumeration code to
    address kernel warnings & mis-mappings on DeviceTree platforms.
 
  - Migrate x86 platform drivers to the .remove_new callback API
 
  - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZByL0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gHaRAAkub9fLBX9s1zxkouv16YJpycqHUmKMSf
 YeLoRGBAXDcdy4b5exgBTFxTKQE17TAsifsjhIKF1ObAslXHDfNNPUhZ41X8M3aw
 m5TyZxjXQiiafWSBLki+XBhYuL/lo+9tqWmmJZqp1wd2mjyViCqUDMRtm1WRRKYt
 ITlIQdAETGY38lU1h/NAFzVYrSq5/XVh6dW/TQ+Sw7PMkheWVInzIh31Z5/EBtBq
 YvmGETGAK7gUK69TbQuqCY3Aq3Q0mC2QAAGFbeHubl7Q13q4lkSdI5D7P0Ffg+Kg
 95Mg94Fo3qShRNEeOyKR5L5WSGRfDJ7Tu3LfhglDqfYkLGJU++ER9BwvSxXkdbXS
 CCcr4yVHz8DIqZXHzpIyOvb3J4QCCUPkIiG46hpTZX+B9g+tSDOq2M9ETJcZNgux
 NA4wSwJLReAo2+eH+lEZv3LwxfVPwr+LHv6KRPlY0cMu/Gfs5qdCf4DMFsJuBhTD
 eLMwGON3ke/FDqTZkTRWGpcR6MIV7g9LWO7SgK1q8TvucmPXFIMq6v4MLrajSuKy
 V5iXl3Bef5IUZ6LqnHUI1HiXpLuYKb+ruBB2h01AZrrqOWVVyNWhT9Cf2EVsw+gB
 CKmzcBuLpFWiyXM/mRIvPHgPzW32uBuhIq2JpbPTqaEUQ6Wj2N8kY7pw1e5w8u5C
 OIB+LqGBiII=
 =IQi+
 -----END PGP SIGNATURE-----

Merge tag 'x86-platform-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 platform updates from Ingo Molnar:

 - Improve the DeviceTree (OF) NUMA enumeration code to address
   kernel warnings & mis-mappings on DeviceTree platforms

 - Migrate x86 platform drivers to the .remove_new callback API

 - Misc cleanups & fixes

* tag 'x86-platform-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/platform/olpc-xo1-sci: Convert to platform remove callback returning void
  x86/platform/olpc-x01-pm: Convert to platform remove callback returning void
  x86/platform/iris: Convert to platform remove callback returning void
  x86/of: Change x86_dtb_parse_smp_config() to static
  x86/of: Map NUMA node to CPUs as per DeviceTree
  x86/of: Set the parse_smp_cfg for all the DeviceTree platforms by default
  x86/hyperv/vtl: Correct x86_init.mpparse.parse_smp_cfg assignment
2024-05-13 19:29:08 -07:00
Linus Torvalds
963795f758 x86/fpu changes for v6.10:
- Fix asm() constraints & modifiers in restore_fpregs_from_fpstate()
 
  - Update comments
 
  - Robustify the free_vm86() definition
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBw6ARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h66w/+MBZiDW9BdUaMArOO3F2epa+E/SFmcaSO
 KxLvs9A606nV+qW2RhHZjYcdl5oOAF0yRyofxbVLluYPt7z8GPUIrVKPHq7BD3Es
 amzaD0Rq00qPa+jwrt8qsOddz2KglAkgYZJcukf5hBZ6/VfiKfDeRG3D7nTyabVp
 sYwM7POmB5dkBrOTdmp6ikliNvmp8tfh6AiSM3NgQ8uq0YN7tm7f5iFSulBfrRN3
 Y2x6LEXOuSSEzEIO/7ju4maE6JunqWMkRWWb5yyUZKZKG69dunp4LZr5kAfi/7jV
 SZRO16YOZOsl5XBp4QlDv2p5xM/XD3uM8UhUSlMYL0+6i/wpEMnJpcSaffLv5wNG
 I6RxG8d/G1hpsUoW8ClLTWfppL450z31lmwatLa1ctnuGppcx3oxEA+vBTo3I89c
 fVMHvDvTs7iau2K9mmpZzhLLglnf7ZDTclyVsPrECQtB+grFHL8DNKea4nn4VInH
 LO9XBbckuM1ZjJt1KzGNWZbpxRBRpnNVjyyYPodD4el9IyglXzcvVNR0SGCtXB+3
 Td7/RBkBmNadefckOJaT1VEGXlKOlOAKtWB+A17jpzCSoaKSzhXw/khbxIG3oYOv
 BRulU9r16rDzuMiDLjfFHpC4BhjDSluDuFS1Xtg+P3PZQR+LDn9msJaWFWL7pexK
 xhs5daRBrqs=
 =PO9v
 -----END PGP SIGNATURE-----

Merge tag 'x86-fpu-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fpu updates from Ingo Molnar:

 - Fix asm() constraints & modifiers in restore_fpregs_from_fpstate()

 - Update comments

 - Robustify the free_vm86() definition

* tag 'x86-fpu-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Update fpu_swap_kvm_fpu() uses in comments as well
  x86/vm86: Make sure the free_vm86(task) definition uses its parameter even in the !CONFIG_VM86 case
  x86/fpu: Fix AMD X86_BUG_FXSAVE_LEAK fixup
2024-05-13 19:00:26 -07:00
Linus Torvalds
ecd83bcbed x86/cpu changes for v6.10:
- Rework the x86 CPU vendor/family/model code: introduce the 'VFM'
    value that is an 8+8+8 bit concatenation of the vendor/family/model
    value, and add macros that work on VFM values. This simplifies the
    addition of new Intel models & families, and simplifies existing
    enumeration & quirk code.
 
  - Add support for the AMD 0x80000026 leaf, to better parse topology
    information.
 
  - Optimize the NUMA allocation layout of more per-CPU data structures
 
  - Improve the workaround for AMD erratum 1386
 
  - Clear TME from /proc/cpuinfo as well, when disabled by the firmware
 
  - Improve x86 self-tests
 
  - Extend the mce_record tracepoint with the ::ppin and ::microcode fields
 
  - Implement recovery for MCE errors in TDX/SEAM non-root mode
 
  - Misc cleanups and fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBwL0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gfuBAAkfVxMAfXvI4Vn3Em9Pix5zgvOoEshPoI
 Pti8+fqgKAaR/Nn+ZCEUk6nou8E6R0Lyo7yDk4aZ0zGmUwQS0IoRTvj721YojCTS
 Chr7butXH2xkYYQVBiJvKdHVhPBgs6jvExLyRL4WJ6s6zunS86Xka3nVRKD9QqW6
 RpEc83wW9b/oSzxn/Cwzxk9RvXatLL82EMOYPL2B40Lde8EM+zoYsfOwGndGlCB2
 gHpnSL1Jzry5kTeG7rromWWVp6YrDW63R2KO+DB0r7rrrtEyXtoCr7OdxruUijPB
 sSpzN6etRbUuH0ijMbh7EW8KlUkGBx46Y+1eRMeN/qYy0vuwP9v0vP9n/7fXLjvu
 FEI82W07lHjY3OvHh2FzvcHMTWaHVYqwDRLki7ortjtg53F/0l07Cbqxf2zJg+r3
 jIaVCifk4qo6Rq+TvHtGcuDYi36u93UKVcfjQN1K/a2WdzJvpDL63PklzBeTno5s
 7QBSG1FxEbfIXeQaf/AwfjnfzlQhI9ws1F+GuFAP7mGH8vEnDlGhLv5vsnloxcMB
 HnHJE1wOzq6A3ixCFreXccikfsTUgsfmrLExhVs9Er/MsKRsGfSySyFUHA4L/Ygm
 6zqfgYwSJzbn5EnfPmiO1R+tNhlcAi0YENeAOle4HQTeBwqebKl+Zh3zbzpgM2I3
 cppkgnY/HTQ=
 =Zrlk
 -----END PGP SIGNATURE-----

Merge tag 'x86-cpu-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Ingo Molnar:

 - Rework the x86 CPU vendor/family/model code: introduce the 'VFM'
   value that is an 8+8+8 bit concatenation of the vendor/family/model
   value, and add macros that work on VFM values. This simplifies the
   addition of new Intel models & families, and simplifies existing
   enumeration & quirk code.

 - Add support for the AMD 0x80000026 leaf, to better parse topology
   information

 - Optimize the NUMA allocation layout of more per-CPU data structures

 - Improve the workaround for AMD erratum 1386

 - Clear TME from /proc/cpuinfo as well, when disabled by the firmware

 - Improve x86 self-tests

 - Extend the mce_record tracepoint with the ::ppin and ::microcode fields

 - Implement recovery for MCE errors in TDX/SEAM non-root mode

 - Misc cleanups and fixes

* tag 'x86-cpu-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  x86/mm: Switch to new Intel CPU model defines
  x86/tsc_msr: Switch to new Intel CPU model defines
  x86/tsc: Switch to new Intel CPU model defines
  x86/cpu: Switch to new Intel CPU model defines
  x86/resctrl: Switch to new Intel CPU model defines
  x86/microcode/intel: Switch to new Intel CPU model defines
  x86/mce: Switch to new Intel CPU model defines
  x86/cpu: Switch to new Intel CPU model defines
  x86/cpu/intel_epb: Switch to new Intel CPU model defines
  x86/aperfmperf: Switch to new Intel CPU model defines
  x86/apic: Switch to new Intel CPU model defines
  perf/x86/msr: Switch to new Intel CPU model defines
  perf/x86/intel/uncore: Switch to new Intel CPU model defines
  perf/x86/intel/pt: Switch to new Intel CPU model defines
  perf/x86/lbr: Switch to new Intel CPU model defines
  perf/x86/intel/cstate: Switch to new Intel CPU model defines
  x86/bugs: Switch to new Intel CPU model defines
  x86/bugs: Switch to new Intel CPU model defines
  x86/cpu/vfm: Update arch/x86/include/asm/intel-family.h
  x86/cpu/vfm: Add new macros to work with (vendor/family/model) values
  ...
2024-05-13 18:44:44 -07:00
Linus Torvalds
c4273a6692 x86/cleanups changes for v6.10:
- Fix function prototypes to address clang function type cast
    warnings in the math-emu code
 
  - Reorder definitions in <asm/msr-index.h>
 
  - Remove unused code
 
  - Fix typos
 
  - Simplify #include sections
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBvHQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jeSBAAqPMBFEYc5nge52ONZ8bzADEPQ6pBohgO
 xfONNuUpjtQ/Xtnhc8FGoFf+C9pnOlf2eX2VfusqvA6M9XJDgZxu1M6QZSOHuILo
 4T4opzTj7VYLbo1DQGLcPMymW/rhJNwKdRwhHr4SNIk9YcIJS7uyxtnLNvqjcCsB
 /iMw2/mhlXRXN1MP1Eg4YM6BXJ4qYkjx79gzKEGbq6tJgUahR37LGvw1aq+GAiap
 Wbo0o2jLgu8ByZXKEfUmUnW5jMR02LeUBg1OqDjaziO48df6eUi4ngaCoSA5qIew
 SDKZ1uq3qTOlDtGlxIGlBznM/HjvPejr+XQXKukCn+B9N62PMtR4fOS5q/4ODTD+
 wQttK0rg/fLpp1zgv33ey2N0qpbUxbtxC4JkA4DPfqstO/uiQXTNJM6H68Pqr9p/
 6TuW+HYrsgUdi54X4KTEHIAGOSUP0bjJrtSP6Tzxt9+epOQl+ymHaR07a4rRn2cw
 SnK7CQcWsjv90PUkCsb3F7gZtYVOkb4C0ZCPn2AlSPo+y0YnBadG+S6uQ6suFwxA
 kX5QNf+OPmqJZz/muqGQ+c7Swc9ONPdv6RSt35nqp2vz0ugp4Q1FNUciQGfOLj2V
 O0KaFVcdFvlkLGgxgYlGZJKxWKeuhh+L5IHyaL5fy7nOUhJtI+djoF5ZaCfR0Ofp
 Piqz80R6w9I=
 =6pkd
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:

 - Fix function prototypes to address clang function type cast
   warnings in the math-emu code

 - Reorder definitions in <asm/msr-index.h>

 - Remove unused code

 - Fix typos

 - Simplify #include sections

* tag 'x86-cleanups-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pci/ce4100: Remove unused 'struct sim_reg_op'
  x86/msr: Move ARCH_CAP_XAPIC_DISABLE bit definition to its rightful place
  x86/math-emu: Fix function cast warnings
  x86/extable: Remove unused fixup type EX_TYPE_COPY
  x86/rtc: Remove unused intel-mid.h
  x86/32: Remove unused IA32_STACK_TOP and two externs
  x86/head: Simplify relative include path to xen-head.S
  x86/fred: Fix typo in Kconfig description
  x86/syscall/compat: Remove ia32_unistd.h
  x86/syscall/compat: Remove unused macro __SYSCALL_ia32_NR
  x86/virt/tdx: Remove duplicate include
  x86/xen: Remove duplicate #include
2024-05-13 18:21:24 -07:00
Linus Torvalds
d71ec0ed03 x86/build changes for v6.10:
- Use -fpic to build the kexec 'purgatory' (self-contained code that runs between two kernels)
 
  - Clean up vmlinux.lds.S generation
 
  - Simplify the X86_EXTENDED_PLATFORM section of the x86 Kconfig
 
  - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBuqIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hCkhAAoUz4ZPgaN9mN4TvCXzhEMgb2SO8Wm/Jr
 IdHunF9W8q5NMQHWDK5lPsLco95HPeX/Mqq1eWbe6/oAfSpUt38+OL2rq8pjCZnm
 G7wC7paPIK7Onl6l2gM9D+BlWpnHq8wsdGeMyV7VhqdhGAgbv8he+IlZKSUgLyiT
 l8CTzppHy0U6R6UYvz+ZnOWgYevWpVvty2lxrvhTR1VmITLrBNk3AJb8+GYSuqj3
 gUF4oOjiG8WvtjtLYhXw1Kf8vt577ix6iaiow00SP/A4rmWfWIN0WSBQhHcXJwVQ
 RDVHlNAoVJ4GY4oZU88ykuWqe5UEfMcJzI0l3nSqeiLgLpvtA3UNNdVvl+el8wU+
 181+4viNGS2owB9D+Na70BJEiJmGHHE7MfmEQEO1d9az/6Q4tXCJwKS+TymPFWYe
 wYMIz2bf03g+FksxljP9dgwe7enVFCnBhmmms8nfAmpACaLQVtMjElqGzIeTGckh
 52scmA6hXLlTwNVpeARQ36DL6tLkcyTPO2ujrEJzsRvWOB7EbAbpDJfHOhMIFQNt
 M+st803WZ4tRbwrwTYTyU4F4Wt4RkNlYo820M3TDSYfdA+M6h01y3ZR53z1VDvXy
 NuNVlhnl3dQr//VMwHFgCv5hD+VhAs8iKqxzj0W31cChv5WgFG7IGnu70/J9V/1n
 6MZasYJlKbg=
 =xZOX
 -----END PGP SIGNATURE-----

Merge tag 'x86-build-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 build updates from Ingo Molnar:

 - Use -fpic to build the kexec 'purgatory' (the self-contained
   code that runs between two kernels)

 - Clean up vmlinux.lds.S generation

 - Simplify the X86_EXTENDED_PLATFORM section of the x86 Kconfig

 - Misc cleanups & fixes

* tag 'x86-build-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/Kconfig: Merge the two CONFIG_X86_EXTENDED_PLATFORM entries
  x86/purgatory: Switch to the position-independent small code model
  x86/boot: Replace __PHYSICAL_START with LOAD_PHYSICAL_ADDR
  x86/vmlinux.lds.S: Take __START_KERNEL out conditional definition
  x86/vmlinux.lds.S: Remove conditional definition of LOAD_OFFSET
  vmlinux.lds.h: Fix a typo in comment
2024-05-13 18:05:08 -07:00
Linus Torvalds
7e3591453d Use uniform "Oops: " prefix for die() messages.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBuL4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jajQ/6AkJVvivRgSeRkXZVs2Mmhq4WzsFD9guS
 v1gTH3r05h8taKmABOMbBuCQi42JhyltmwnHjmiP0BTveLfGab9qAEN0feOGxtp+
 9TAs/9D0/GpAaK4R6W/VEIAx1poyaBw90jP/RcrRlDkLGAMyr9rIH/lzId9losEb
 iy7qK5lYBynecwEE7YPuIWp7x4hpSbdVs7Uttadq1dSqYsdJK1kTp+t+zr3TVZDr
 DjdQ2XgqAthbg4Bkwvc6K8Vve3vSanWcpHlkaKMyMoNfpfueubYdFmbgX/mZ51g8
 nGFEEAWSXQDFmvEXmKI2xrAXLNNshDBd0ts3AM0MJlimcGYeIF9kqaMkGqLl9DVp
 TNWn2Gb2RbjC09R+n4B0XlwFTH5lgvmYQ00x9jDB1z1rztPILXohqbs8IDJEIT7I
 dsUWjv1MI+zbAV5TzY2LBFyg19TecdPJv8FI9MNo45JPHmv1M7NhkmSK60bQ0Ekb
 2qV47zE+AlkXHZQG8a0Ss0YdC0aALqmky/zNKtVGuGLJrBCkhSavxCjTrYmsendS
 gPVxYRWIAfjqv+mGmw70qdfqTtEeuTclhsZ0SMwkHwcOBmdrU7BIWQslm8YbHd2D
 Gltk048zwl0V26k5DI2Wnrur8YNa3/OTtdNVc6D788NouBlq4gUZTi3JP4D//bp4
 V9kuQ3y0IrQ=
 =j8eM
 -----END PGP SIGNATURE-----

Merge tag 'x86-bugs-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 oops message cleanup from Ingo Molnar:

 - Use uniform "Oops: " prefix for die() messages

* tag 'x86-bugs-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/dumpstack: Use uniform "Oops: " prefix for die() messages
2024-05-13 18:03:15 -07:00
Linus Torvalds
9d8e0d52a2 x86/boot changes for v6.10:
- Move the kernel cmdline setup earlier in the boot process (again),
    to address a split_lock_detect= boot parameter bug.
 
  - Ignore relocations in .notes sections
 
  - Simplify boot stack setup
 
  - Re-introduce a bootloader quirk wrt. CR4 handling
 
  - Miscellaneous cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBt20RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jqAg//RwFDdZsxgE+2zc8x04xJuIFLLyXmEFD9
 /x0QhXzLWuxJU1E8XReHnDJhPr8yDWWQZrYzU8B9wkPGPoqh42s9Gb6YHKQw++/f
 F2c3EjVdIBcebMufWvSTnrmQc5Env6Ka5te96arK6F76KjH7snRPV3Vl0p5aO2pO
 GzVWuxfhmQtw6GxX+mzFCSlv1cLQBLM72P++6b7QiT3C5kWhcieaeYdzHcekrNPL
 i5BdHoE8ldqRu0Un9KCLbvyA20XsVGsjSLi3mOqguoCpIVI47J+bMnJWF7xpKhHI
 Zyv4pL0ftOC0K9mqF+f3JS6vGlevBIsdqzjfog/oRpO/iLSMEbMj/3jv2BYFAE1l
 HmhWDUaUtdvb/mU1PAUzhSZl8Qsjl25vlV7mAT2w6KAr/l1Y9fZGXZU2huFnw/3H
 AaMoiyIUDV0OO2h6TIvuH78YKl/aq3awLbZcZ4m4XD16Eg3rzq8vHKTVGt/kIaxW
 /z/C0HemSD9qKDoqwevUTGNbJJfWEUrx1wNK8B4Bw/EBN9Md6IgtINKgdG68/8HW
 xr9iJ9L34lTAKWtjIznqsJg8nq6q8ccGMngDCoN1KbVbn2z7jQqzWvCLml/PLwsO
 bdTxYBearZKMsmhCwj/qEBM58X3G2lQCl4KIUGQjyO6lWGTGGLCaQiw8lDQNu54E
 LyFJh2rwltE=
 =p7K3
 -----END PGP SIGNATURE-----

Merge tag 'x86-boot-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot updates from Ingo Molnar:

 - Move the kernel cmdline setup earlier in the boot process (again),
   to address a split_lock_detect= boot parameter bug

 - Ignore relocations in .notes sections

 - Simplify boot stack setup

 - Re-introduce a bootloader quirk wrt CR4 handling

 - Miscellaneous cleanups & fixes

* tag 'x86-boot-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/64: Clear most of CR4 in startup_64(), except PAE, MCE and LA57
  x86/boot: Move kernel cmdline setup earlier in the boot process (again)
  x86/build: Clean up arch/x86/tools/relocs.c a bit
  x86/boot: Ignore relocations in .notes sections in walk_relocs() too
  x86: Rename __{start,end}_init_task to __{start,end}_init_stack
  x86/boot: Simplify boot stack setup
2024-05-13 17:50:36 -07:00
Linus Torvalds
48fc82c40b Locking changes for v6.10:
- Over a dozen code generation micro-optimizations for the atomic
    and spinlock code.
 
  - Add more __ro_after_init attributes
 
  - Robustify the lockdevent_*() macros
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBrMMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gSuA//YyLRTCGtH6d/fCudlzzoa14MHO/QiCv7
 lgmq3Vqif/m+MW7LwQJbLrxDPJPT1mE9Ol9woOc133Cj1QZhF/HQvDAKT9ZpMoXU
 d8U3kuZ7tN41TJuQx6vNSCv3w5ToKeXaQJGxiT6od2Y/0QlhUKhVBSBQVtyc/ma6
 o1Uhq1Qp5KPj928jiqwI0JCZJFqqLvzq/rIT38V05phHEPet4GbLMbz9ZTsw70pm
 xmLzGLXJQ9maziuVcmRUrctsAkbk+VhChQ9p4HrH6AcYPwyQoF+zJr7iocyzIMG2
 xQqhEYShI72lcRft8hZwlrLTKZJWSAkDIxIxaQ2egzsNBwBPbRpP0mUIz3qbwJxQ
 fqzKGxwDmxjiX1Ib4gIVje66hp2QpPX5G1ARoeKvbrHkXxzqVuFlaQBn1+OAQ/GV
 mNzKADxrjalhyiMksHXbEbUNEvXCGqC2N9AOWT6XNvpLDqTJBz/wB+f9cbx3gYEO
 9rXwVicWXLzUnEfbRaEjCrDeMEHMLqhaZIndgCx07JpFkkTtKLD1N9tBxFPNH+SP
 XK7SAsXrxwhBjGbWItfF4eOaPCey+/+kGhOPadfTg3g9zDjEBvX/YNBBw9q2CUWc
 JWd/gct+/Jnnkh1jdIj9yRF2xciVY+iOshHRzG+clo/PhRTwv+DwfMJ/uzn+oaSF
 vOT+exKA8bg=
 =rT48
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - Over a dozen code generation micro-optimizations for the atomic
   and spinlock code

 - Add more __ro_after_init attributes

 - Robustify the lockdevent_*() macros

* tag 'locking-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/pvqspinlock/x86: Use _Q_LOCKED_VAL in PV_UNLOCK_ASM macro
  locking/qspinlock/x86: Micro-optimize virt_spin_lock()
  locking/atomic/x86: Merge __arch{,_try}_cmpxchg64_emu_local() with __arch{,_try}_cmpxchg64_emu()
  locking/atomic/x86: Introduce arch_try_cmpxchg64_local()
  locking/pvqspinlock/x86: Remove redundant CMP after CMPXCHG in __raw_callee_save___pv_queued_spin_unlock()
  locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h
  locking/pvqspinlock: Use try_cmpxchg_acquire() in trylock_clear_pending()
  locking/qspinlock: Use atomic_try_cmpxchg_relaxed() in xchg_tail()
  locking/atomic/x86: Define arch_atomic_sub() family using arch_atomic_add() functions
  locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions
  locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32
  locking/atomic/x86: Introduce arch_atomic64_try_cmpxchg() to x86_32
  locking/atomic/x86: Introduce arch_try_cmpxchg64() for !CONFIG_X86_CMPXCHG64
  locking/atomic/x86: Modernize x86_32 arch_{,try_}_cmpxchg64{,_local}()
  locking/atomic/x86: Correct the definition of __arch_try_cmpxchg128()
  x86/tsc: Make __use_tsc __ro_after_init
  x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  context_tracking: Make context_tracking_key __ro_after_init
  jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
  locking/qspinlock: Always evaluate lockevent* non-event parameter once
2024-05-13 17:01:28 -07:00
Paolo Bonzini
4232da23d7 Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.10

1. Add ParaVirt IPI support.
2. Add software breakpoint support.
3. Add mmio trace events support.
2024-05-10 13:20:18 -04:00
Thomas Gleixner
5754ace3c3 x86/topology/amd: Ensure that LLC ID is initialized
The original topology evaluation code initialized cpu_data::topo::llc_id
with the die ID initialy and then eventually overwrite it with information
gathered from a CPUID leaf.

The conversion analysis failed to spot that particular detail and omitted
this initial assignment under the assumption that each topology evaluation
path will set it up. That assumption is mostly correct, but turns out to be
wrong in case that the CPUID leaf 0x80000006 does not provide a LLC ID.

In that case, LLC ID is invalid and as a consequence the setup of the
scheduling domain CPU masks is incorrect which subsequently causes the
scheduler core to complain about it during CPU hotplug:

  BUG: arch topology borken
       the CLS domain not a subset of the MC domain

Cure it by reusing legacy_set_llc() and assigning the die ID if the LLC ID
is invalid after all possible parsers have been tried.

Fixes: f7fb3b2dd9 ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Reported-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Link: https://lore.kernel.org/r/PUZPR04MB63168AC442C12627E827368581292@PUZPR04MB6316.apcprd04.prod.outlook.com
2024-05-10 17:42:50 +02:00
Shyam Sundar S K
0e640f0a47 x86/amd_nb: Add new PCI IDs for AMD family 0x1a
Add the new PCI Device IDs to the MISC IDs list to support new
generation of AMD 1Ah family 70h Models of processors.

  [ bp: Massage commit message. ]

Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240510111829.969501-1-Shyam-sundar.S-k@amd.com
2024-05-10 14:52:46 +02:00
Masahiro Yamada
b1992c3772 kbuild: use $(src) instead of $(srctree)/$(src) for source directory
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for
checked-in source files. It is merely a convention without any functional
difference. In fact, $(obj) and $(src) are exactly the same, as defined
in scripts/Makefile.build:

    src := $(obj)

When the kernel is built in a separate output directory, $(src) does
not accurately reflect the source directory location. While Kbuild
resolves this discrepancy by specifying VPATH=$(srctree) to search for
source files, it does not cover all cases. For example, when adding a
header search path for local headers, -I$(srctree)/$(src) is typically
passed to the compiler.

This introduces inconsistency between upstream and downstream Makefiles
because $(src) is used instead of $(srctree)/$(src) for the latter.

To address this inconsistency, this commit changes the semantics of
$(src) so that it always points to the directory in the source tree.

Going forward, the variables used in Makefiles will have the following
meanings:

  $(obj)     - directory in the object tree
  $(src)     - directory in the source tree  (changed by this commit)
  $(objtree) - the top of the kernel object tree
  $(srctree) - the top of the kernel source tree

Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced
with $(src).

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-05-10 04:34:52 +09:00
Dr. David Alan Gilbert
57f6d0aed7 x86/microcode: Remove unused struct cpu_info_ctx
This looks unused since

  2071c0aeda ("x86/microcode: Simplify init path even more")

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240506004300.770564-1-linux@treblig.org
2024-05-06 11:00:57 +02:00
Thomas Gleixner
720a22fd6c x86/apic: Don't access the APIC when disabling x2APIC
With 'iommu=off' on the kernel command line and x2APIC enabled by the BIOS
the code which disables the x2APIC triggers an unchecked MSR access error:

  RDMSR from 0x802 at rIP: 0xffffffff94079992 (native_apic_msr_read+0x12/0x50)

This is happens because default_acpi_madt_oem_check() selects an x2APIC
driver before the x2APIC is disabled.

When the x2APIC is disabled because interrupt remapping cannot be enabled
due to 'iommu=off' on the command line, x2apic_disable() invokes
apic_set_fixmap() which in turn tries to read the APIC ID. This triggers
the MSR warning because x2APIC is disabled, but the APIC driver is still
x2APIC based.

Prevent that by adding an argument to apic_set_fixmap() which makes the
APIC ID read out conditional and set it to false from the x2APIC disable
path. That's correct as the APIC ID has already been read out during early
discovery.

Fixes: d10a904435 ("x86/apic: Consolidate boot_cpu_physical_apicid initialization sites")
Reported-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/875xw5t6r7.ffs@tglx
2024-04-30 07:51:34 +02:00
Jacob Pan
fef05a078b x86/irq: Factor out common code for checking pending interrupts
Use a common function for checking pending interrupt vector in APIC IRR
instead of duplicated open coding them.

Additional checks for posted MSI vectors can then be contained in this
function.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423174114.526704-10-jacob.jun.pan@linux.intel.com
2024-04-30 00:54:43 +02:00
Jacob Pan
1b03d82ba1 x86/irq: Install posted MSI notification handler
All MSI vectors are multiplexed into a single notification vector when
posted MSI is enabled. It is the responsibility of the notification vector
handler to demultiplex MSI vectors. In the handler the MSI vector handlers
are dispatched without IDT delivery for each pending MSI interrupt.

For example, the interrupt flow will change as follows:
(3 MSIs of different vectors arrive in a a high frequency burst)

BEFORE:
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

AFTER:
interrupt /* Posted MSI notification vector */
    irq_enter()
	atomic_xchg(PIR)
	handler()
	handler()
	handler()
	pi_clear_on()
    apic_eoi()
    irq_exit()
        process_softirq()

Except for the leading MSI, CPU notifications are skipped/coalesced.

For MSIs which arrive at a low frequency, the demultiplexing loop does not
wait for more interrupts to coalesce. Therefore, there's no additional
latency other than the processing time.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423174114.526704-9-jacob.jun.pan@linux.intel.com
2024-04-30 00:54:42 +02:00
Jacob Pan
6087c7f36a x86/irq: Factor out handler invocation from common_interrupt()
Prepare for calling external interrupt handlers directly from the posted
MSI demultiplexing loop. Extract the common code from common_interrupt() to
avoid code duplication.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423174114.526704-8-jacob.jun.pan@linux.intel.com
2024-04-30 00:54:42 +02:00
Jacob Pan
43650dcf6d x86/irq: Set up per host CPU posted interrupt descriptors
To support posted MSIs, create a posted interrupt descriptor (PID) for each
host CPU. Later on, when setting up interrupt affinity, the IOMMU's
interrupt remapping table entry (IRTE) will point to the physical address
of the matching CPU's PID.

Each PID is initialized with the owner CPU's physical APICID as the
destination.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423174114.526704-7-jacob.jun.pan@linux.intel.com
2024-04-30 00:54:42 +02:00
Daniel J Blueman
455f9075f1 x86/tsc: Trust initial offset in architectural TSC-adjust MSRs
When the BIOS configures the architectural TSC-adjust MSRs on secondary
sockets to correct a constant inter-chassis offset, after Linux brings the
cores online, the TSC sync check later resets the core-local MSR to 0,
triggering HPET fallback and leading to performance loss.

Fix this by unconditionally using the initial adjust values read from the
MSRs. Trusting the initial offsets in this architectural mechanism is a
better approach than special-casing workarounds for specific platforms.

Signed-off-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steffen Persvold <sp@numascale.com>
Reviewed-by: James Cleverdon <james.cleverdon.external@eviden.com>
Reviewed-by: Dimitri Sivanich <sivanich@hpe.com>
Reviewed-by: Prarit Bhargava <prarit@redhat.com>
Link: https://lore.kernel.org/r/20240419085146.175665-1-daniel@quora.org
2024-04-29 23:27:16 +02:00
Ashish Kalra
d6d85ac15c x86/e820: Add a new e820 table update helper
Add a new API helper e820__range_update_table() with which to update an
arbitrary e820 table. Move all current users of
e820__range_update_kexec() to this new helper.

  [ bp: Massage. ]

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/b726af213ad55053f8a7a1e793b01bb3f1ca9dd5.1714090302.git.ashish.kalra@amd.com
2024-04-29 11:15:31 +02:00
Tony Luck
d9b6886cd7 x86/tsc_msr: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181518.41927-1-tony.luck%40intel.com
2024-04-29 10:31:34 +02:00
Tony Luck
f21b075b67 x86/tsc: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181517.41907-1-tony.luck%40intel.com
2024-04-29 10:31:32 +02:00
Tony Luck
4db64279bc x86/cpu: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

[ dhansen: vertically align macro and remove stray subject / ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181516.41887-1-tony.luck%40intel.com
2024-04-29 10:31:30 +02:00
Tony Luck
db99675e43 x86/resctrl: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

  [ bp: Squash two resctrl patches into one. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181514.41848-1-tony.luck%40intel.com
2024-04-29 10:31:28 +02:00
Tony Luck
375a756448 x86/microcode/intel: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181513.41829-1-tony.luck%40intel.com
2024-04-29 10:31:26 +02:00
Tony Luck
4a5f2dd162 x86/mce: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

  [ bp: Squash *three* mce patches into one, fold in fix:
    https://lore.kernel.org/r/20240429022051.63360-1-tony.luck@intel.com ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181511.41772-1-tony.luck%40intel.com
2024-04-29 10:31:23 +02:00
Tony Luck
c73cd37221 x86/cpu: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181511.41753-1-tony.luck%40intel.com
2024-04-29 10:31:19 +02:00
Tony Luck
fe3edc524d x86/cpu/intel_epb: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181510.41733-1-tony.luck%40intel.com
2024-04-29 10:31:16 +02:00
Tony Luck
d32bc2111c x86/aperfmperf: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181505.41654-1-tony.luck%40intel.com
2024-04-29 10:31:12 +02:00
Tony Luck
8fb5f44e5d x86/apic: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181504.41634-1-tony.luck%40intel.com
2024-04-29 10:31:08 +02:00
Rick Edgecombe
c44357c2e7 x86/mm: care about shadow stack guard gap during placement
When memory is being placed, mmap() will take care to respect the guard
gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
needs to consider two things:

 1. That the new mapping isn't placed in an any existing mappings guard
    gaps.
 2. That the new mapping isn't placed such that any existing mappings
    are not in *its* guard gaps.

The longstanding behavior of mmap() is to ensure 1, but not take any care
around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
with a PAGE_SIZE size, and a type that has a guard gap is being placed,
mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
mapping that is supposed to have a guard gap will not have a gap to the
adjacent VMA.

Now that the vm_flags is passed into the arch get_unmapped_area()'s, and
vm_unmapped_area() is ready to consider it, have VM_SHADOW_STACK's get
guard gap consideration for scenario 2.

Link: https://lkml.kernel.org/r/20240326021656.202649-14-rick.p.edgecombe@intel.com
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:28 -07:00
Rick Edgecombe
c5ecd8eb8c x86/mm: implement HAVE_ARCH_UNMAPPED_AREA_VMFLAGS
When memory is being placed, mmap() will take care to respect the guard
gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
VM_GROWSDOWN).  In order to ensure guard gaps between mappings, mmap()
needs to consider two things:

 1. That the new mapping isn't placed in an any existing mappings guard
    gaps.
 2. That the new mapping isn't placed such that any existing mappings
    are not in *its* guard gaps.

The longstanding behavior of mmap() is to ensure 1, but not take any care
around 2.  So for example, if there is a PAGE_SIZE free area, and a mmap()
with a PAGE_SIZE size, and a type that has a guard gap is being placed,
mmap() may place the shadow stack in the PAGE_SIZE free area.  Then the
mapping that is supposed to have a guard gap will not have a gap to the
adjacent VMA.

Add x86 arch implementations of arch_get_unmapped_area_vmflags/_topdown()
so future changes can allow the guard gap of type of vma being placed to
be taken into account.  This will be used for shadow stack memory.

Link: https://lkml.kernel.org/r/20240326021656.202649-13-rick.p.edgecombe@intel.com
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:28 -07:00
Rick Edgecombe
b80fa3cbb7 treewide: use initializer for struct vm_unmapped_area_info
Future changes will need to add a new member to struct
vm_unmapped_area_info.  This would cause trouble for any call site that
doesn't initialize the struct.  Currently every caller sets each member
manually, so if new ones are added they will be uninitialized and the core
code parsing the struct will see garbage in the new member.

It could be possible to initialize the new member manually to 0 at each
call site.  This and a couple other options were discussed.  Having some
struct vm_unmapped_area_info instances not zero initialized will put those
sites at risk of feeding garbage into vm_unmapped_area(), if the
convention is to zero initialize the struct and any new field addition
missed a call site that initializes each field manually.  So it is useful
to do things similar across the kernel.

The consensus (see links) was that in general the best way to accomplish
taking into account both code cleanliness and minimizing the chance of
introducing bugs, was to do C99 static initialization.  As in: struct
vm_unmapped_area_info info = {};

With this method of initialization, the whole struct will be zero
initialized, and any statements setting fields to zero will be unneeded. 
The change should not leave cleanup at the call sides.

While iterating though the possible solutions a few archs kindly acked
other variations that still zero initialized the struct.  These sites have
been modified in previous changes using the pattern acked by the
respective arch.

So to be reduce the chance of bugs via uninitialized fields, perform a
tree wide change using the consensus for the best general way to do this
change.  Use C99 static initializing to zero the struct and remove and
statements that simply set members to zero.

Link: https://lkml.kernel.org/r/20240326021656.202649-11-rick.p.edgecombe@intel.com
Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t
Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/
Link: https://lore.kernel.org/lkml/ec3e377a-c0a0-4dd3-9cb9-96517e54d17e@csgroup.eu/
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:27 -07:00
Rick Edgecombe
529ce23a76 mm: switch mm->get_unmapped_area() to a flag
The mm_struct contains a function pointer *get_unmapped_area(), which is
set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown()
during the initialization of the mm.

Since the function pointer only ever points to two functions that are
named the same across all arch's, a function pointer is not really
required.  In addition future changes will want to add versions of the
functions that take additional arguments.  So to save a pointers worth of
bytes in mm_struct, and prevent adding additional function pointers to
mm_struct in future changes, remove it and keep the information about
which get_unmapped_area() to use in a flag.

Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by
mmf_init_flags().  Most MM flags get clobbered on fork.  In the
pre-existing behavior mm->get_unmapped_area() would get copied to the new
mm in dup_mm(), so not clobbering the flag preserves the existing behavior
around inheriting the topdown-ness.

Introduce a helper, mm_get_unmapped_area(), to easily convert code that
refers to the old function pointer to instead select and call either
arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the
flag.  Then drop the mm->get_unmapped_area() function pointer.  Leave the
get_unmapped_area() pointer in struct file_operations alone.  The main
purpose of this change is to reorganize in preparation for future changes,
but it also converts the calls of mm->get_unmapped_area() from indirect
branches into a direct ones.

The stress-ng bigheap benchmark calls realloc a lot, which calls through
get_unmapped_area() in the kernel.  On x86, the change yielded a ~1%
improvement there on a retpoline config.

In testing a few x86 configs, removing the pointer unfortunately didn't
result in any actual size reductions in the compiled layout of mm_struct. 
But depending on compiler or arch alignment requirements, the change could
shrink the size of mm_struct.

Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:25 -07:00
Baoquan He
fdb022f6e9 x86: remove unneeded memblock_find_dma_reserve()
Patch series "mm/mm_init.c: refactor free_area_init_core()".

In function free_area_init_core(), the code calculating
zone->managed_pages and the subtracting dma_reserve from DMA zone looks
very confusing.

From git history, the code calculating zone->managed_pages was for
zone->present_pages originally.  The early rough assignment is for
optimize zone's pcp and water mark setting.  Later, managed_pages was
introduced into zone to represent the number of managed pages by buddy. 
Now, zone->managed_pages is zeroed out and reset in mem_init() when
calling memblock_free_all().  zone's pcp and wmark setting relying on
actual zone->managed_pages are done later than mem_init() invocation.  So
we don't need rush to early calculate and set zone->managed_pages, just
set it as zone->present_pages, will adjust it in mem_init().

And also add a new function calc_nr_kernel_pages() to count up free but
not reserved pages in memblock, then assign it to nr_all_pages and
nr_kernel_pages after memmap pages are allocated.


This patch (of 6):

Variable dma_reserve and its usage was introduced in commit 0e0b864e06
("[PATCH] Account for memmap and optionally the kernel image as holes"). 
Its original purpose was to accounting for the reserved pages in DMA zone
to make DMA zone's watermarks calculation more accurate on x86.

However, currently there's zone->managed_pages to account for all
available pages for buddy, zone->present_pages to account for all present
physical pages in zone.  What is more important, on x86, calculating and
setting the zone->managed_pages is a temporary move, all zone's
managed_pages will be zeroed out and reset to the actual value according
to how many pages are added to buddy allocator in mem_init().  Before
mem_init(), no buddy alloction is requested.  And zone's pcp and watermark
setting are all done after mem_init().  So, no need to worry about the DMA
zone's setting accuracy during free_area_init().

Hence, remove memblock_find_dma_reserve() to stop calculating and
setting dma_reserve.

Link: https://lkml.kernel.org/r/20240325145646.1044760-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20240325145646.1044760-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:10 -07:00
Suren Baghdasaryan
8a2f118787 change alloc_pages name in dma_map_ops to avoid name conflicts
After redefining alloc_pages, all uses of that name are being replaced. 
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.

Link: https://lkml.kernel.org/r/20240321163705.3067592-18-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:55:53 -07:00
Kent Overstreet
0069455bcb fix missing vmalloc.h includes
Patch series "Memory allocation profiling", v6.

Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for
debug kernels, overhead low enough to be deployed in production.

Example output:
  root@moria-kvm:~# sort -rn /proc/allocinfo
   127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
    56373248     4737 mm/slub.c:2259 func:alloc_slab_page
    14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
    14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
    13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
    11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
     9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
     4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
     4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
     3940352      962 mm/memory.c:4214 func:alloc_anon_folio
     2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
     ...

Usage:
kconfig options:
 - CONFIG_MEM_ALLOC_PROFILING
 - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
 - CONFIG_MEM_ALLOC_PROFILING_DEBUG
   adds warnings for allocations that weren't accounted because of a
   missing annotation

sysctl:
  /proc/sys/vm/mem_profiling

Runtime info:
  /proc/allocinfo

Notes:

[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
(6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
(7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y

Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:

                        kmalloc                 pgalloc
(1 baseline)            6.764s                  16.902s
(2 default disabled)    6.793s  (+0.43%)        17.007s (+0.62%)
(3 default enabled)     7.197s  (+6.40%)        23.666s (+40.02%)
(4 runtime enabled)     7.405s  (+9.48%)        23.901s (+41.41%)
(5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
(6 def disabled+memcg)  13.332s (+97.10%)       48.105s (+184.61%)
(7 def enabled+memcg)   13.446s (+98.78%)       54.963s (+225.18%)

Memory overhead:
Kernel size:

   text           data        bss         dec         diff
(1) 26515311	      18890222    17018880    62424413
(2) 26524728	      19423818    16740352    62688898    264485
(3) 26524724	      19423818    16740352    62688894    264481
(4) 26524728	      19423818    16740352    62688898    264485
(5) 26541782	      18964374    16957440    62463596    39183

Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags:           192 kB
PageExts:         262144 kB (256MB)
SlabExts:           9876 kB (9.6MB)
PcpuExts:            512 kB (0.5MB)

Total overhead is 0.2% of total memory.

Benchmarks:

Hackbench tests run 100 times:
hackbench -s 512 -l 200 -g 15 -f 25 -P
      baseline       disabled profiling           enabled profiling
avg   0.3543         0.3559 (+0.0016)             0.3566 (+0.0023)
stdev 0.0137         0.0188                       0.0077


hackbench -l 10000
      baseline       disabled profiling           enabled profiling
avg   6.4218         6.4306 (+0.0088)             6.5077 (+0.0859)
stdev 0.0933         0.0286                       0.0489

stress-ng tests:
stress-ng --class memory --seq 4 -t 60
stress-ng --class cpu --seq 4 -t 60
Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/

[2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/


This patch (of 37):

The next patch drops vmalloc.h from a system header in order to fix a
circular dependency; this adds it to all the files that were pulling it in
implicitly.

[kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c]
  Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev
[surenb@google.com: fix arch/x86/mm/numa_32.c]
  Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com
[kent.overstreet@linux.dev: a few places were depending on sizes.h]
  Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev
[arnd@arndb.de: fix mm/kasan/hw_tags.c]
  Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org
[surenb@google.com: fix arc build]
  Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:55:49 -07:00
Tom Lendacky
1e52550729 x86/sev: Shorten struct name snp_secrets_page_layout to snp_secrets_page
Ending a struct name with "layout" is a little redundant, so shorten the
snp_secrets_page_layout name to just snp_secrets_page.

No functional change.

  [ bp: Rename the local pointer to "secrets" too for more clarity. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/bc8d58302c6ab66c3beeab50cce3ec2c6bd72d6c.1713974291.git.thomas.lendacky@amd.com
2024-04-25 16:13:51 +02:00
Tony Luck
b24e466abf x86/bugs: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240424181507.41693-1-tony.luck@intel.com
2024-04-25 12:42:13 +02:00
Tony Luck
8a28b02202 x86/bugs: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240424181506.41673-1-tony.luck@intel.com
2024-04-25 12:27:25 +02:00
David Kaplan
b53c6bd5d2 x86/cpu: Fix check for RDPKRU in __show_regs()
cpu_feature_enabled(X86_FEATURE_OSPKE) does not necessarily reflect
whether CR4.PKE is set on the CPU.  In particular, they may differ on
non-BSP CPUs before setup_pku() is executed.  In this scenario, RDPKRU
will #UD causing the system to hang.

Fix by checking CR4 for PKE enablement which is always correct for the
current CPU.

The scenario happens by inserting a WARN* before setup_pku() in
identiy_cpu() or some other diagnostic which would lead to calling
__show_regs().

  [ bp: Massage commit message. ]

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240421191728.32239-1-bp@kernel.org
2024-04-24 14:30:21 +02:00
Haifeng Xu
931be446c6 x86/resctrl: Add tracepoint for llc_occupancy tracking
In our production environment, after removing monitor groups, those
unused RMIDs get stuck in the limbo list forever because their
llc_occupancy is always larger than the threshold. But the unused RMIDs
can be successfully freed by turning up the threshold.

In order to know how much the threshold should be, perf can be used to
acquire the llc_occupancy of RMIDs in each rdt domain.

Instead of using perf tool to track llc_occupancy and filter the log
manually, it is more convenient for users to use tracepoint to do this
work. So add a new tracepoint that shows the llc_occupancy of busy RMIDs
when scanning the limbo list.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Suggested-by: James Morse <james.morse@arm.com>
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240408092303.26413-3-haifeng.xu@shopee.com
2024-04-24 14:24:48 +02:00
Haifeng Xu
8773922948 x86/resctrl: Rename pseudo_lock_event.h to trace.h
Now only the pseudo-locking part uses tracepoints to do event tracking,
but other parts of resctrl may need new tracepoints. It is unnecessary
to create separate header files and define CREATE_TRACE_POINTS in
different c files which fragments the resctrl tracing.

Therefore, give the resctrl tracepoint header file a generic name to
support its use for tracepoints that are not specific to pseudo-locking.

No functional change.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240408092303.26413-2-haifeng.xu@shopee.com
2024-04-24 14:21:52 +02:00
Wenkuan Wang
2718a7fdf2 x86/CPU/AMD: Add models 0x10-0x1f to the Zen5 range
Add some more Zen5 models.

Fixes: 3e4147f33f ("x86/CPU/AMD: Add X86_FEATURE_ZEN5")
Signed-off-by: Wenkuan Wang <Wenkuan.Wang@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240423144111.1362-1-bp@kernel.org
2024-04-24 14:05:25 +02:00
Tony Luck
bd4955d4bc x86/resctrl: Simplify call convention for MSR update functions
The per-resource MSR update functions cat_wrmsr(), mba_wrmsr_intel(),
and mba_wrmsr_amd() all take three arguments:

  (struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)

struct msr_param contains pointers to both struct rdt_resource and struct
rdt_domain, thus only struct msr_param is necessary.

Pass struct msr_param as a single parameter. Clean up formatting and
fix some fir tree declaration ordering.

No functional change.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://lore.kernel.org/r/20240308213846.77075-3-tony.luck@intel.com
2024-04-24 13:47:00 +02:00
Tony Luck
e3ca96e479 x86/resctrl: Pass domain to target CPU
reset_all_ctrls() and resctrl_arch_update_domains() use on_each_cpu_mask()
to call rdt_ctrl_update() on potentially one CPU from each domain.

But this means rdt_ctrl_update() needs to figure out which domain to
apply changes to. Doing so requires a search of all domains in a resource,
which can only be done safely if cpus_lock is held. Both callers do hold
this lock, but there isn't a way for a function called on another CPU
via IPI to verify this.

Commit

  c0d848fcb0 ("x86/resctrl: Remove lockdep annotation that triggers
  false positive")

removed the incorrect assertions.

Add the target domain to the msr_param structure and call
rdt_ctrl_update() for each domain separately using
smp_call_function_single(). This means that rdt_ctrl_update() doesn't
need to search for the domain and get_domain_from_cpu() can safely
assert that the cpus_lock is held since the remaining callers do not use
IPI.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://lore.kernel.org/r/20240308213846.77075-2-tony.luck@intel.com
2024-04-24 13:41:41 +02:00
Sourabh Jain
79365026f8 crash: add a new kexec flag for hotplug support
Commit a72bbec70d ("crash: hotplug support for kexec_load()")
introduced a new kexec flag, `KEXEC_UPDATE_ELFCOREHDR`. Kexec tool uses
this flag to indicate to the kernel that it is safe to modify the
elfcorehdr of the kdump image loaded using the kexec_load system call.

However, it is possible that architectures may need to update kexec
segments other then elfcorehdr. For example, FDT (Flatten Device Tree)
on PowerPC. Introducing a new kexec flag for every new kexec segment
may not be a good solution. Hence, a generic kexec flag bit,
`KEXEC_CRASH_HOTPLUG_SUPPORT`, is introduced to share the CPU/Memory
hotplug support intent between the kexec tool and the kernel for the
kexec_load system call.

Now we have two kexec flags that enables crash hotplug support for
kexec_load system call. First is KEXEC_UPDATE_ELFCOREHDR (only used in
x86), and second is KEXEC_CRASH_HOTPLUG_SUPPORT (for all architectures).

To simplify the process of finding and reporting the crash hotplug
support the following changes are introduced.

1. Define arch specific function to process the kexec flags and
   determine crash hotplug support

2. Rename the @update_elfcorehdr member of struct kimage to
   @hotplug_support and populate it for both kexec_load and
   kexec_file_load syscalls, because architecture can update more than
   one kexec segment

3. Let generic function crash_check_hotplug_support report hotplug
   support for loaded kdump image based on value of @hotplug_support

To bring the x86 crash hotplug support in line with the above points,
the following changes have been made:

- Introduce the arch_crash_hotplug_support function to process kexec
  flags and determine crash hotplug support

- Remove the arch_crash_hotplug_[cpu|memory]_support functions

Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240326055413.186534-3-sourabhjain@linux.ibm.com
2024-04-23 14:59:01 +10:00
Sourabh Jain
118005713e crash: forward memory_notify arg to arch crash hotplug handler
In the event of memory hotplug or online/offline events, the crash
memory hotplug notifier `crash_memhp_notifier()` receives a
`memory_notify` object but doesn't forward that object to the
generic and architecture-specific crash hotplug handler.

The `memory_notify` object contains the starting PFN (Page Frame Number)
and the number of pages in the hot-removed memory. This information is
necessary for architectures like PowerPC to update/recreate the kdump
image, specifically `elfcorehdr`.

So update the function signature of `crash_handle_hotplug_event()` and
`arch_crash_handle_hotplug_event()` to accept the `memory_notify` object
as an argument from crash memory hotplug notifier.

Since no such object is available in the case of CPU hotplug event, the
crash CPU hotplug notifier `crash_cpuhp_online()` passes NULL to the
crash hotplug handler.

Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240326055413.186534-2-sourabhjain@linux.ibm.com
2024-04-23 14:59:01 +10:00
Tom Lendacky
e70316d17f x86/sev: Check for MWAITX and MONITORX opcodes in the #VC handler
The MWAITX and MONITORX instructions generate the same #VC error code as
the MWAIT and MONITOR instructions, respectively. Update the #VC handler
opcode checking to also support the MWAITX and MONITORX opcodes.

Fixes: e3ef461af3 ("x86/sev: Harden #VC instruction emulation somewhat")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/453d5a7cfb4b9fe818b6fb67f93ae25468bc9e23.1713793161.git.thomas.lendacky@amd.com
2024-04-22 18:38:28 +02:00
Tony Luck
f055b6260e x86/cpu/vfm: Update arch/x86/include/asm/intel-family.h
New CPU #defines encode vendor and family as well as model.

Update the example usage comment in arch/x86/kernel/cpu/match.c

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240416211941.9369-4-tony.luck@intel.com
2024-04-22 11:44:00 +02:00
Paolo Bonzini
a96cb3bf39 Merge x86 bugfixes from Linux 6.9-rc3
Pull fix for SEV-SNP late disable bugs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19 09:02:22 -04:00
Eric Biggers
9543f6e266 x86/cpufeatures: Fix dependencies for GFNI, VAES, and VPCLMULQDQ
Fix cpuid_deps[] to list the correct dependencies for GFNI, VAES, and
VPCLMULQDQ.  These features don't depend on AVX512, and there exist CPUs
that support these features but not AVX512.  GFNI actually doesn't even
depend on AVX.

This prevents GFNI from being unnecessarily disabled if AVX is disabled
to mitigate the GDS vulnerability.

This also prevents all three features from being unnecessarily disabled
if AVX512VL (or its dependency AVX512F) were to be disabled, but it
looks like there isn't any case where this happens anyway.

Fixes: c128dbfa0f ("x86/cpufeatures: Enable new SSE/AVX/AVX512 CPU features")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20240417060434.47101-1-ebiggers@kernel.org
2024-04-18 17:27:52 +02:00
Josh Poimboeuf
69129794d9 x86/bugs: Fix BHI retpoline check
Confusingly, X86_FEATURE_RETPOLINE doesn't mean retpolines are enabled,
as it also includes the original "AMD retpoline" which isn't a retpoline
at all.

Also replace cpu_feature_enabled() with boot_cpu_has() because this is
before alternatives are patched and cpu_feature_enabled()'s fallback
path is slower than plain old boot_cpu_has().

Fixes: ec9404e40e ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/ad3807424a3953f0323c011a643405619f2a4927.1712944776.git.jpoimboe@kernel.org
2024-04-14 11:10:05 +02:00
Li RongQing
90167e9658 x86/sev: Take NUMA node into account when allocating memory for per-CPU SEV data
per-CPU SEV data is dominantly accessed from their own local CPUs,
so allocate them node-local to improve performance.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Nikunj A Dadhania <nikunj@amd.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20240412030130.49704-1-lirongqing@baidu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-12 12:12:11 +02:00
Ingo Molnar
21f546a43a Merge branch 'x86/urgent' into x86/cpu, to resolve conflict
There's a new conflict between this commit pending in x86/cpu:

  63edbaa48a x86/cpu/topology: Add support for the AMD 0x80000026 leaf

And these fixes in x86/urgent:

  c064b536a8 x86/cpu/amd: Make the NODEID_MSR union actually work
  1b3108f689 x86/cpu/amd: Make the CPUID 0x80000008 parser correct

Resolve them.

 Conflicts:
	arch/x86/kernel/cpu/topology_amd.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-12 12:11:45 +02:00
Thomas Gleixner
7211274fe0 x86/cpu/amd: Move TOPOEXT enablement into the topology parser
The topology rework missed that early_init_amd() tries to re-enable the
Topology Extensions when the BIOS disabled them.

The new parser is invoked before early_init_amd() so the re-enable attempt
happens too late.

Move it into the AMD specific topology parser code where it belongs.

Fixes: f7fb3b2dd9 ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/878r1j260l.ffs@tglx
2024-04-12 12:05:54 +02:00
Thomas Gleixner
c064b536a8 x86/cpu/amd: Make the NODEID_MSR union actually work
A system with NODEID_MSR was reported to crash during early boot without
any output.

The reason is that the union which is used for accessing the bitfields in
the MSR is written wrongly and the resulting executable code accesses the
wrong part of the MSR data.

As a consequence a later division by that value results in 0 and that
result is used for another division as divisor, which obviously does not
work well.

The magic world of C, unions and bitfields:

    union {
    	  u64   bita : 3,
	        bitb : 3;
	  u64   all;
    } x;

    x.all = foo();

    a = x.bita;
    b = x.bitb;

results in the effective executable code of:

   a = b = x.bita;

because bita and bitb are treated as union members and therefore both end
up at bit offset 0.

Wrapping the bitfield into an anonymous struct:

    union {
    	  struct {
    	     u64  bita : 3,
	          bitb : 3;
          };
	  u64	  all;
    } x;

works like expected.

Rework the NODEID_MSR union in exactly that way to cure the problem.

Fixes: f7fb3b2dd9 ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Reported-by: Laura Nao <laura.nao@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Laura Nao <laura.nao@collabora.com>
Link: https://lore.kernel.org/r/20240410194311.596282919@linutronix.de
Closes: https://lore.kernel.org/all/20240322175210.124416-1-laura.nao@collabora.com/
2024-04-12 12:05:54 +02:00
Thomas Gleixner
1b3108f689 x86/cpu/amd: Make the CPUID 0x80000008 parser correct
CPUID 0x80000008 ECX.cpu_nthreads describes the number of threads in the
package. The parser uses this value to initialize the SMT domain level.

That's wrong because cpu_nthreads does not describe the number of threads
per physical core. So this needs to set the CORE domain level and let the
later parsers set the SMT shift if available.

Preset the SMT domain level with the assumption of one thread per core,
which is correct ifrt here are no other CPUID leafs to parse, and propagate
cpu_nthreads and the core level APIC bitwidth into the CORE domain.

Fixes: f7fb3b2dd9 ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Reported-by: Laura Nao <laura.nao@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Laura Nao <laura.nao@collabora.com>
Link: https://lore.kernel.org/r/20240410194311.535206450@linutronix.de
2024-04-12 12:05:54 +02:00
Josh Poimboeuf
4f511739c5 x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
For consistency with the other CONFIG_MITIGATION_* options, replace the
CONFIG_SPECTRE_BHI_{ON,OFF} options with a single
CONFIG_MITIGATION_SPECTRE_BHI option.

[ mingo: Fix ]

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/3833812ea63e7fdbe36bf8b932e63f70d18e2a2a.1712813475.git.jpoimboe@kernel.org
2024-04-12 12:05:54 +02:00
Josh Poimboeuf
36d4fe147c x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
Unlike most other mitigations' "auto" options, spectre_bhi=auto only
mitigates newer systems, which is confusing and not particularly useful.

Remove it.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/412e9dc87971b622bbbaf64740ebc1f140bff343.1712813475.git.jpoimboe@kernel.org
2024-04-12 12:05:54 +02:00
Paolo Bonzini
eb4441864e KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time
SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA.
Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark
FPU contents as confidential after it has been copied and encrypted into
the VMSA.

Since the XSAVE state for AVX is the first, it does not need the
compacted-state handling of get_xsave_addr().  However, there are other
parts of XSAVE state in the VMSA that currently are not handled, and
the validation logic of get_xsave_addr() is pointless to duplicate
in KVM, so move get_xsave_addr() to public FPU API; it is really just
a facility to operate on XSAVE state and does not expose any internal
details of arch/x86/kernel/fpu.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:25 -04:00
Ingo Molnar
5b9b2e6baf Linux 6.9-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmYTAJYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG2bEH/jOBXd0ZCz+s9+F4
 TbSvDEE8UjitQdEJ5WSBY9CEvFI8OuVQr23gYPUn+gfgqLX0Vsp8HfxL6bBP5Tj6
 DSzAkwF/mvIfa6VLFmO1GmvyhYtmWkmbM825tNqKHSNTBc9cCLH3H+780wNtTMwQ
 VkB8O3KS/wZBGKSbFfiXW+fT3SkWIMLtdBAaox+vcxHXpiluXxSbxANRD5kTbdG0
 UAW9S4+3A0jNk/KeXEvJDqkf7C3ASsjtNPbK+gFDfOXxdNYFTC2IUf93rL61VB4s
 C2rtUklcLE8gFDtvkQ8JtGWmDj4pWPEDIyhICKlzP/aKCjXcNzLaoM0GJQYJS+PN
 aNevw24=
 =318J
 -----END PGP SIGNATURE-----

Merge tag 'v6.9-rc3' into x86/boot, to pick up fixes before queueing up more changes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-11 15:36:23 +02:00
Josh Poimboeuf
5f882f3b0a x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
While syscall hardening helps prevent some BHI attacks, there's still
other low-hanging fruit remaining.  Don't classify it as a mitigation
and make it clear that the system may still be vulnerable if it doesn't
have a HW or SW mitigation enabled.

Fixes: ec9404e40e ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/b5951dae3fdee7f1520d5136a27be3bdfe95f88b.1712813475.git.jpoimboe@kernel.org
2024-04-11 10:30:33 +02:00
Josh Poimboeuf
1cea8a280d x86/bugs: Fix BHI handling of RRSBA
The ARCH_CAP_RRSBA check isn't correct: RRSBA may have already been
disabled by the Spectre v2 mitigation (or can otherwise be disabled by
the BHI mitigation itself if needed).  In that case retpolines are fine.

Fixes: ec9404e40e ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/6f56f13da34a0834b69163467449be7f58f253dc.1712813475.git.jpoimboe@kernel.org
2024-04-11 10:30:33 +02:00
Ingo Molnar
d0485730d2 x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
So we are using the 'ia32_cap' value in a number of places,
which got its name from MSR_IA32_ARCH_CAPABILITIES MSR register.

But there's very little 'IA32' about it - this isn't 32-bit only
code, nor does it originate from there, it's just a historic
quirk that many Intel MSR names are prefixed with IA32_.

This is already clear from the helper method around the MSR:
x86_read_arch_cap_msr(), which doesn't have the IA32 prefix.

So rename 'ia32_cap' to 'x86_arch_cap_msr' to be consistent with
its role and with the naming of the helper function.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
2024-04-11 10:30:33 +02:00
Josh Poimboeuf
cb2db5bb04 x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
There's no need to keep reading MSR_IA32_ARCH_CAPABILITIES over and
over.  It's even read in the BHI sysfs function which is a big no-no.
Just read it once and cache it.

Fixes: ec9404e40e ("x86/bhi: Add BHI mitigation knob")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
2024-04-11 10:30:33 +02:00
Thomas Gleixner
a9025cd1c6 x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
topo_set_cpuids() updates cpu_present_map and cpu_possible map. It is
invoked during enumeration and "physical hotplug" operations. In the
latter case this results in a kernel crash because cpu_possible_map is
marked read only after init completes.

There is no reason to update cpu_possible_map in that function. During
enumeration cpu_possible_map is not relevant and gets fully initialized
after enumeration completed. On "physical hotplug" the bit is already set
because the kernel allows only CPUs to be plugged which have been
enumerated and associated to a CPU number during early boot.

Remove the bogus update of cpu_possible_map.

Fixes: 0e53e7b656 ("x86/cpu/topology: Sanitize the APIC admission logic")
Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87ttkc6kwx.ffs@tglx
2024-04-10 15:31:38 +02:00
Daniel Sneddon
04f4230e2f x86/bugs: Fix return type of spectre_bhi_state()
The definition of spectre_bhi_state() incorrectly returns a const char
* const. This causes the a compiler warning when building with W=1:

 warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
 2812 | static const char * const spectre_bhi_state(void)

Remove the const qualifier from the pointer.

Fixes: ec9404e40e ("x86/bhi: Add BHI mitigation knob")
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240409230806.1545822-1-daniel.sneddon@linux.intel.com
2024-04-10 07:05:04 +02:00
Ingo Molnar
dbbe13a6f6 x86/cpu: Improve readability of per-CPU cpumask initialization code
In smp_prepare_cpus_common() and x2apic_prepare_cpu():

 - use 'cpu' instead of 'i'
 - use 'node' instead of 'n'
 - use vertical alignment to improve readability
 - better structure basic blocks
 - reduce col80 checkpatch damage

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
2024-04-10 07:02:33 +02:00
Li RongQing
e0a9ac192f x86/cpu: Take NUMA node into account when allocating per-CPU cpumasks
per-CPU cpumasks are dominantly accessed from their own local CPUs,
so allocate them node-local to improve performance.

[ mingo: Rewrote the changelog. ]

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240410030114.6201-1-lirongqing@baidu.com
2024-04-10 06:55:31 +02:00
Borislav Petkov (AMD)
05d277c9a9 x86/alternatives: Sort local vars in apply_alternatives()
In a reverse x-mas tree.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240130105941.19707-5-bp@alien8.de
2024-04-09 18:16:57 +02:00
Borislav Petkov (AMD)
c3a3cb5c3d x86/alternatives: Optimize optimize_nops()
Return early if NOPs have already been optimized.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240130105941.19707-4-bp@alien8.de
2024-04-09 18:15:03 +02:00
Borislav Petkov (AMD)
da8f9cf7e7 x86/alternatives: Get rid of __optimize_nops()
There's no need to carve out bits of the NOP optimization functionality
and look at JMP opcodes - simply do one more NOPs optimization pass
at the end of patching.

A lot simpler code.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240130105941.19707-3-bp@alien8.de
2024-04-09 18:12:53 +02:00
Borislav Petkov (AMD)
f796c75837 x86/alternatives: Use a temporary buffer when optimizing NOPs
Instead of optimizing NOPs in-place, use a temporary buffer like the
usual alternatives patching flow does. This obviates the need to grab
locks when patching, see

  6778977590da ("x86/alternatives: Disable interrupts and sync when optimizing NOPs in place")

While at it, add nomenclature definitions clarifying and simplifying the
naming of function-local variables in the alternatives code.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240130105941.19707-2-bp@alien8.de
2024-04-09 18:08:11 +02:00
Borislav Petkov (AMD)
ee8962082a x86/alternatives: Catch late X86_FEATURE modifiers
After alternatives have been patched, changes to the X86_FEATURE flags
won't take effect and could potentially even be wrong.

Warn about it.

This is something which has been long overdue.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-3-bp@alien8.de
2024-04-09 18:03:53 +02:00
Ingo Molnar
d1eec383a8 Linux 6.9-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmYTAJYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG2bEH/jOBXd0ZCz+s9+F4
 TbSvDEE8UjitQdEJ5WSBY9CEvFI8OuVQr23gYPUn+gfgqLX0Vsp8HfxL6bBP5Tj6
 DSzAkwF/mvIfa6VLFmO1GmvyhYtmWkmbM825tNqKHSNTBc9cCLH3H+780wNtTMwQ
 VkB8O3KS/wZBGKSbFfiXW+fT3SkWIMLtdBAaox+vcxHXpiluXxSbxANRD5kTbdG0
 UAW9S4+3A0jNk/KeXEvJDqkf7C3ASsjtNPbK+gFDfOXxdNYFTC2IUf93rL61VB4s
 C2rtUklcLE8gFDtvkQ8JtGWmDj4pWPEDIyhICKlzP/aKCjXcNzLaoM0GJQYJS+PN
 aNevw24=
 =318J
 -----END PGP SIGNATURE-----

Merge tag 'v6.9-rc3' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-09 09:48:09 +02:00
Tony Luck
7911f145de x86/mce: Implement recovery for errors in TDX/SEAM non-root mode
Machine check SMIs (MSMI) signaled during SEAM operation (typically
inside TDX guests), on a system with Intel eMCA enabled, might eventually
be reported to the kernel #MC handler with the saved RIP on the stack
pointing to the instruction in kernel code after the SEAMCALL instruction
that entered the SEAM operation. Linux currently says that is a fatal
error and shuts down.

There is a new bit in IA32_MCG_STATUS that, when set to 1, indicates
that the machine check didn't originally occur at that saved RIP, but
during SEAM non-root operation.

Add new entries to the severity table to detect this for both data load
and instruction fetch that set the severity to "AR" (action required).

Increase the width of the mcgmask/mcgres fields in "struct severity"
from unsigned char to unsigned short since the new bit is in position 12.

Action required for these errors is just mark the page as poisoned and
return from the machine check handler.

HW ABI notes:
=============

The SEAM_NR bit in IA32_MCG_STATUS hasn't yet made it into the Intel
Software Developers' Manual. But it is described in section 16.5.2
of "Intel(R) Trust Domain Extensions (Intel(R) TDX) Module Base
Architecture Specification" downloadable from:

  https://cdrdv2.intel.com/v1/dl/getContent/733575

Backport notes:
===============

Little value in backporting this patch to stable or LTS kernels as
this is only relevant with support for TDX, which I assume won't be
backported. But for anyone taking this to v6.1 or older, you also
need commit:

  a51cbd0d86 ("x86/mce: Use severity table to handle uncorrected errors in kernel")

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240408180944.44638-1-tony.luck@intel.com
2024-04-09 09:30:36 +02:00
Ingo Molnar
0e6ebfd163 Linux 6.9-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmYTAJYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG2bEH/jOBXd0ZCz+s9+F4
 TbSvDEE8UjitQdEJ5WSBY9CEvFI8OuVQr23gYPUn+gfgqLX0Vsp8HfxL6bBP5Tj6
 DSzAkwF/mvIfa6VLFmO1GmvyhYtmWkmbM825tNqKHSNTBc9cCLH3H+780wNtTMwQ
 VkB8O3KS/wZBGKSbFfiXW+fT3SkWIMLtdBAaox+vcxHXpiluXxSbxANRD5kTbdG0
 UAW9S4+3A0jNk/KeXEvJDqkf7C3ASsjtNPbK+gFDfOXxdNYFTC2IUf93rL61VB4s
 C2rtUklcLE8gFDtvkQ8JtGWmDj4pWPEDIyhICKlzP/aKCjXcNzLaoM0GJQYJS+PN
 aNevw24=
 =318J
 -----END PGP SIGNATURE-----

Merge tag 'v6.9-rc3' into x86/cpu, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-09 09:28:41 +02:00
Linus Torvalds
2bb69f5fc7 x86 mitigations for the native BHI hardware vulnerabilty:
Branch History Injection (BHI) attacks may allow a malicious application to
 influence indirect branch prediction in kernel by poisoning the branch
 history. eIBRS isolates indirect branch targets in ring0.  The BHB can
 still influence the choice of indirect branch predictor entry, and although
 branch predictor entries are isolated between modes when eIBRS is enabled,
 the BHB itself is not isolated between modes.
 
 Add mitigations against it either with the help of microcode or with
 software sequences for the affected CPUs.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmYUKPMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofT8EACJJix+GzGUcJjOvfWFZcxwziY152hO
 5XSzHOZZL6oz5Yk/Rye/S9RVTN7aDjn1CEvI0cD/ULxaTP869sS9dDdUcHhEJ//5
 6hjqWsWiKc1QmLjBy3Pcb97GZHQXM5a9D1f6jXnJD+0FMLbQHpzSEBit0H4tv/TC
 75myGgYihvUbhN9/bL10M5fz+UADU42nChvPWDMr9ukljjCqa46tPTmKUIAW5TWj
 /xsyf+Nk+4kZpdaidKGhpof6KCV2rNeevvzUGN8Pv5y13iAmvlyplqTcQ6dlubnZ
 CuDX5Ji9spNF9WmhKpLgy5N+Ocb64oVHov98N2zw1sT1N8XOYcSM0fBj7SQIFURs
 L7T4jBZS+1c3ZGJPPFWIaGjV8w1ZMhelglwJxjY7ZgRD6fK3mwRx/ks54J8H4HjE
 FbirXaZLeKlscDIOKtnxxKoIGwpdGwLKQYi/wEw7F9NhCLSj9wMia+j3uYIUEEHr
 6xEiYEtyjcV3ocxagH7eiHyrasOKG64vjx2h1XodusBA2Wrvgm/jXlchUu+wb6B4
 LiiZJt+DmOdQ1h5j3r2rt3hw7+nWa7kyq34qfN6NSUCHiedp6q7BClueSaKiOCGk
 RoNibNiS+CqaxwGxj/RGuvajEJeEMCsLuCxzT3aeaDBsqscW6Ka/HkGA76Tpb5nJ
 E3JyjYE7AlG4rw==
 =W0W3
 -----END PGP SIGNATURE-----

Merge tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mitigations from Thomas Gleixner:
 "Mitigations for the native BHI hardware vulnerabilty:

  Branch History Injection (BHI) attacks may allow a malicious
  application to influence indirect branch prediction in kernel by
  poisoning the branch history. eIBRS isolates indirect branch targets
  in ring0. The BHB can still influence the choice of indirect branch
  predictor entry, and although branch predictor entries are isolated
  between modes when eIBRS is enabled, the BHB itself is not isolated
  between modes.

  Add mitigations against it either with the help of microcode or with
  software sequences for the affected CPUs"

[ This also ends up enabling the full mitigation by default despite the
  system call hardening, because apparently there are other indirect
  calls that are still sufficiently reachable, and the 'auto' case just
  isn't hardened enough.

  We'll have some more inevitable tweaking in the future    - Linus ]

* tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM: x86: Add BHI_NO
  x86/bhi: Mitigate KVM by default
  x86/bhi: Add BHI mitigation knob
  x86/bhi: Enumerate Branch History Injection (BHI) bug
  x86/bhi: Define SPEC_CTRL_BHI_DIS_S
  x86/bhi: Add support for clearing branch history at syscall entry
  x86/syscall: Don't force use of indirect calls for system calls
  x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file
2024-04-08 20:07:51 -07:00
Pawan Gupta
95a6ccbdc7 x86/bhi: Mitigate KVM by default
BHI mitigation mode spectre_bhi=auto does not deploy the software
mitigation by default. In a cloud environment, it is a likely scenario
where userspace is trusted but the guests are not trusted. Deploying
system wide mitigation in such cases is not desirable.

Update the auto mode to unconditionally mitigate against malicious
guests. Deploy the software sequence at VMexit in auto mode also, when
hardware mitigation is not available. Unlike the force =on mode,
software sequence is not deployed at syscalls in auto mode.

Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:06 +02:00
Pawan Gupta
ec9404e40e x86/bhi: Add BHI mitigation knob
Branch history clearing software sequences and hardware control
BHI_DIS_S were defined to mitigate Branch History Injection (BHI).

Add cmdline spectre_bhi={on|off|auto} to control BHI mitigation:

 auto - Deploy the hardware mitigation BHI_DIS_S, if available.
 on   - Deploy the hardware mitigation BHI_DIS_S, if available,
        otherwise deploy the software sequence at syscall entry and
	VMexit.
 off  - Turn off BHI mitigation.

The default is auto mode which does not deploy the software sequence
mitigation.  This is because of the hardening done in the syscall
dispatch path, which is the likely target of BHI.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Pawan Gupta
be482ff950 x86/bhi: Enumerate Branch History Injection (BHI) bug
Mitigation for BHI is selected based on the bug enumeration. Add bits
needed to enumerate BHI bug.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Daniel Sneddon
0f4a837615 x86/bhi: Define SPEC_CTRL_BHI_DIS_S
Newer processors supports a hardware control BHI_DIS_S to mitigate
Branch History Injection (BHI). Setting BHI_DIS_S protects the kernel
from userspace BHI attacks without having to manually overwrite the
branch history.

Define MSR_SPEC_CTRL bit BHI_DIS_S and its enumeration CPUID.BHI_CTRL.
Mitigation is enabled later.

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08 19:27:05 +02:00
Josh Poimboeuf
0cd01ac5dc x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file
Change the format of the 'spectre_v2' vulnerabilities sysfs file
slightly by converting the commas to semicolons, so that mitigations for
future variants can be grouped together and separated by commas.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2024-04-08 19:27:05 +02:00
Ingo Molnar
5f2ca44ed2 Merge branch 'linus' into x86/urgent, to pick up dependent commit
We want to fix:

  0e11073247 ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO")

So merge in Linus's latest into x86/urgent to have it available.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-06 13:00:32 +02:00
Borislav Petkov (AMD)
3287c22957 x86/microcode/AMD: Remove unused PATCH_MAX_SIZE macro
Orphaned after

  05e91e7211 ("x86/microcode/AMD: Rip out static buffers")

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-04-05 23:33:08 +02:00
Arnd Bergmann
9e11fc78e2 x86/microcode/AMD: Avoid -Wformat warning with clang-15
Older versions of clang show a warning for amd.c after a fix for a gcc
warning:

  arch/x86/kernel/cpu/microcode/amd.c:478:47: error: format specifies type \
    'unsigned char' but the argument has type 'u16' (aka 'unsigned short') [-Werror,-Wformat]
                           "amd-ucode/microcode_amd_fam%02hhxh.bin", family);
                                                       ~~~~~~        ^~~~~~
                                                       %02hx

In clang-16 and higher, this warning is disabled by default, but clang-15 is
still supported, and it's trivial to avoid by adapting the types according
to the range of the passed data and the format string.

  [ bp: Massage commit message. ]

Fixes: 2e9064facc ("x86/microcode/amd: Fix snprintf() format string warning in W=1 build")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240405204919.1003409-1-arnd@kernel.org
2024-04-05 23:26:18 +02:00
Linus Torvalds
c88b9b4cde Including fixes from netfilter, bluetooth and bpf.
Fairly usual collection of driver and core fixes. The large selftest
 accompanying one of the fixes is also becoming a common occurrence.
 
 Current release - regressions:
 
  - ipv6: fix infinite recursion in fib6_dump_done()
 
  - net/rds: fix possible null-deref in newly added error path
 
 Current release - new code bugs:
 
  - net: do not consume a full cacheline for system_page_pool
 
  - bpf: fix bpf_arena-related file descriptor leaks in the verifier
 
  - drv: ice: fix freeing uninitialized pointers, fixing misuse of
    the newfangled __free() auto-cleanup
 
 Previous releases - regressions:
 
  - x86/bpf: fixes the BPF JIT with retbleed=stuff
 
  - xen-netfront: add missing skb_mark_for_recycle, fix page pool
    accounting leaks, revealed by recently added explicit warning
 
  - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
    non-wildcard addresses
 
  - Bluetooth:
    - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT"
      with better workarounds to un-break some buggy Qualcomm devices
    - set conn encrypted before conn establishes, fix re-connecting
      to some headsets which use slightly unusual sequence of msgs
 
  - mptcp:
    - prevent BPF accessing lowat from a subflow socket
    - don't account accept() of non-MPC client as fallback to TCP
 
  - drv: mana: fix Rx DMA datasize and skb_over_panic
 
  - drv: i40e: fix VF MAC filter removal
 
 Previous releases - always broken:
 
  - gro: various fixes related to UDP tunnels - netns crossing problems,
    incorrect checksum conversions, and incorrect packet transformations
    which may lead to panics
 
  - bpf: support deferring bpf_link dealloc to after RCU grace period
 
  - nf_tables:
    - release batch on table validation from abort path
    - release mutex after nft_gc_seq_end from abort path
    - flush pending destroy work before exit_net release
 
  - drv: r8169: skip DASH fw status checks when DASH is disabled
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmYO91wACgkQMUZtbf5S
 IrvHBQ/+PH/hobI+o3aLqwtdVlyxhmA31bVQ0I3aTIZV7c3ideMBcfgYa8TiZM2g
 pLiBiWoJXCN0h33wgUmlUee+sBvpoPCdPjGD/g99OJyKWjVt2D7ObnSwxMfjHUoq
 dtcN2JupqHP0SHz6wPPCmnWtTLxSGUsDdKjmkHQcCRhQIGTYFkYyHcOmPgNbBjaB
 6jvmH1kE9WQTFD8QcOMaZmXQ5omoafpxxQLsgundtOWxPWHL7XNvk0B5k/ESDRG1
 ujbxwtNnOESzpxZMQ6OyZlsnN/1tWfnEvLJFYVwf9BMrOlahJT/f5b/EJ9/Xy4dC
 zkAp7Tul3uAvNRKhBNhVBTWQbnIykmiNMp1VBFmiScQAy8hcnX+6d4LKTIHxbXZK
 V3AqcUS6YU2nyMdLRkhvq9f3uxD6hcY19gQdyqgCUPOtyUAs/JPv7lXQjCuuEqkq
 urEZkigUApnEqPIrIqANJ7nXUy3U0K8qU6evOZoGZ5OdiKeNKC3+tIr+g2f1ZUZq
 a7Dkat7JH9WQ7IG8Geody6Z30K9EpSqYMTKzB5wTfmuqw6cV8bl9OAW9UOSRK0GL
 pyG8GwpkpFPkNiZdu9Zt44Pno5xdLIa1+C3QZR0r5CJWYAzCbI80MppP5veF9Mw+
 v+2v8iBWuh9iv0AUj9KJOwG5QQ+EXLUuSlhtx/DFnmn2CJ9plXI=
 =6bQI
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, bluetooth and bpf.

  Fairly usual collection of driver and core fixes. The large selftest
  accompanying one of the fixes is also becoming a common occurrence.

  Current release - regressions:

   - ipv6: fix infinite recursion in fib6_dump_done()

   - net/rds: fix possible null-deref in newly added error path

  Current release - new code bugs:

   - net: do not consume a full cacheline for system_page_pool

   - bpf: fix bpf_arena-related file descriptor leaks in the verifier

   - drv: ice: fix freeing uninitialized pointers, fixing misuse of the
     newfangled __free() auto-cleanup

  Previous releases - regressions:

   - x86/bpf: fixes the BPF JIT with retbleed=stuff

   - xen-netfront: add missing skb_mark_for_recycle, fix page pool
     accounting leaks, revealed by recently added explicit warning

   - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
     non-wildcard addresses

   - Bluetooth:
      - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
        better workarounds to un-break some buggy Qualcomm devices
      - set conn encrypted before conn establishes, fix re-connecting to
        some headsets which use slightly unusual sequence of msgs

   - mptcp:
      - prevent BPF accessing lowat from a subflow socket
      - don't account accept() of non-MPC client as fallback to TCP

   - drv: mana: fix Rx DMA datasize and skb_over_panic

   - drv: i40e: fix VF MAC filter removal

  Previous releases - always broken:

   - gro: various fixes related to UDP tunnels - netns crossing
     problems, incorrect checksum conversions, and incorrect packet
     transformations which may lead to panics

   - bpf: support deferring bpf_link dealloc to after RCU grace period

   - nf_tables:
      - release batch on table validation from abort path
      - release mutex after nft_gc_seq_end from abort path
      - flush pending destroy work before exit_net release

   - drv: r8169: skip DASH fw status checks when DASH is disabled"

* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  netfilter: validate user input for expected length
  net/sched: act_skbmod: prevent kernel-infoleak
  net: usb: ax88179_178a: avoid the interface always configured as random address
  net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
  net: ravb: Always update error counters
  net: ravb: Always process TX descriptor ring
  netfilter: nf_tables: discard table flag update with pending basechain deletion
  netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
  netfilter: nf_tables: reject new basechain after table flag update
  netfilter: nf_tables: flush pending destroy work before exit_net release
  netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
  netfilter: nf_tables: release batch on table validation from abort path
  Revert "tg3: Remove residual error handling in tg3_suspend"
  tg3: Remove residual error handling in tg3_suspend
  net: mana: Fix Rx DMA datasize and skb_over_panic
  net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
  net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
  net: stmmac: fix rx queue priority assignment
  net: txgbe: fix i2c dev name cannot match clkdev
  net: fec: Set mac_managed_pm during probe
  ...
2024-04-04 14:49:10 -07:00
Borislav Petkov (AMD)
3ddf944b32 x86/mce: Make sure to grab mce_sysfs_mutex in set_bank()
Modifying a MCA bank's MCA_CTL bits which control which error types to
be reported is done over

  /sys/devices/system/machinecheck/
  ├── machinecheck0
  │   ├── bank0
  │   ├── bank1
  │   ├── bank10
  │   ├── bank11
  ...

sysfs nodes by writing the new bit mask of events to enable.

When the write is accepted, the kernel deletes all current timers and
reinits all banks.

Doing that in parallel can lead to initializing a timer which is already
armed and in the timer wheel, i.e., in use already:

  ODEBUG: init active (active state 0) object: ffff888063a28000 object
  type: timer_list hint: mce_timer_fn+0x0/0x240 arch/x86/kernel/cpu/mce/core.c:2642
  WARNING: CPU: 0 PID: 8120 at lib/debugobjects.c:514
  debug_print_object+0x1a0/0x2a0 lib/debugobjects.c:514

Fix that by grabbing the sysfs mutex as the rest of the MCA sysfs code
does.

Reported by: Yue Sun <samsun1006219@gmail.com>
Reported by: xingwei lee <xrivendell7@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/CAEkJfYNiENwQY8yV1LYJ9LjJs%2Bx_-PqMv98gKig55=2vbzffRw@mail.gmail.com
2024-04-04 17:25:15 +02:00
Tong Tiangen
cb517619f9 x86/extable: Remove unused fixup type EX_TYPE_COPY
After

  034ff37d34 ("x86: rewrite '__copy_user_nocache' function")

rewrote __copy_user_nocache() to use EX_TYPE_UACCESS instead of the
EX_TYPE_COPY exception type, there are no more EX_TYPE_COPY users, so
remove it.

  [ bp: Massage commit message. ]

Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240204082627.3892816-2-tongtiangen@huawei.com
2024-04-04 17:01:40 +02:00
Borislav Petkov (AMD)
0ecaefb303 x86/CPU/AMD: Track SNP host status with cc_platform_*()
The host SNP worthiness can determined later, after alternatives have
been patched, in snp_rmptable_init() depending on cmdline options like
iommu=pt which is incompatible with SNP, for example.

Which means that one cannot use X86_FEATURE_SEV_SNP and will need to
have a special flag for that control.

Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places.

Move kdump_sev_callback() to its rightful place, while at it.

Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-6-bp@alien8.de
2024-04-04 10:40:30 +02:00
Jason A. Donenfeld
99485c4c02 x86/coco: Require seeding RNG with RDRAND on CoCo systems
There are few uses of CoCo that don't rely on working cryptography and
hence a working RNG. Unfortunately, the CoCo threat model means that the
VM host cannot be trusted and may actively work against guests to
extract secrets or manipulate computation. Since a malicious host can
modify or observe nearly all inputs to guests, the only remaining source
of entropy for CoCo guests is RDRAND.

If RDRAND is broken -- due to CPU hardware fault -- the RNG as a whole
is meant to gracefully continue on gathering entropy from other sources,
but since there aren't other sources on CoCo, this is catastrophic.
This is mostly a concern at boot time when initially seeding the RNG, as
after that the consequences of a broken RDRAND are much more
theoretical.

So, try at boot to seed the RNG using 256 bits of RDRAND output. If this
fails, panic(). This will also trigger if the system is booted without
RDRAND, as RDRAND is essential for a safe CoCo boot.

Add this deliberately to be "just a CoCo x86 driver feature" and not
part of the RNG itself. Many device drivers and platforms have some
desire to contribute something to the RNG, and add_device_randomness()
is specifically meant for this purpose.

Any driver can call it with seed data of any quality, or even garbage
quality, and it can only possibly make the quality of the RNG better or
have no effect, but can never make it worse.

Rather than trying to build something into the core of the RNG, consider
the particular CoCo issue just a CoCo issue, and therefore separate it
all out into driver (well, arch/platform) code.

  [ bp: Massage commit message. ]

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240326160735.73531-1-Jason@zx2c4.com
2024-04-04 10:40:19 +02:00
Li RongQing
af813acf8c x86/fpu: Update fpu_swap_kvm_fpu() uses in comments as well
The following commit:

  d69c1382e1 ("x86/kvm: Convert FPU handling to a single swap buffer")

reworked KVM FPU handling, but forgot to update the comments
in xstate_op_valid(): fpu_swap_kvm_fpu() doesn't exist anymore,
fpu_swap_kvm_fpstate() is used instead.

Update the comments accordingly.

[ mingo: Improved the changelog. ]

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240403091803.818-1-lirongqing@baidu.com
2024-04-04 09:49:42 +02:00
Linus Torvalds
0f099dc9d1 ARM:
- Ensure perf events programmed to count during guest execution
   are actually enabled before entering the guest in the nVHE
   configuration.
 
 - Restore out-of-range handler for stage-2 translation faults.
 
 - Several fixes to stage-2 TLB invalidations to avoid stale
   translations, possibly including partial walk caches.
 
 - Fix early handling of architectural VHE-only systems to ensure E2H is
   appropriately set.
 
 - Correct a format specifier warning in the arch_timer selftest.
 
 - Make the KVM banner message correctly handle all of the possible
   configurations.
 
 RISC-V:
 
 - Remove redundant semicolon in num_isa_ext_regs().
 
 - Fix APLIC setipnum_le/be write emulation.
 
 - Fix APLIC in_clrip[x] read emulation.
 
 x86:
 
 - Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old
   vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting
   disabled.
 
 - Documentation fixes for SEV.
 
 - Fix compat ABI for KVM_MEMORY_ENCRYPT_OP.
 
 - Fix a 14-year-old goof in a declaration shared by host and guest; the enabled
   field used by Linux when running as a guest pushes the size of "struct
   kvm_vcpu_pv_apf_data" from 64 to 68 bytes.  This is really unconsequential
   because KVM never consumes anything beyond the first 64 bytes, but the
   resulting struct does not match the documentation.
 
 Selftests:
 
 - Fix spelling mistake in arch_timer selftest.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmYMOJYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP2zAf/Z7/cK0+yFSvm7/tsbWtjnWofad/p
 82puu0V+8lZSjGVs3AydiDCV+FahvLS0QIwgrffVr4XA10Km5ZZMjZyJ3uH4xki/
 VFFsDnZPdKuj55T0wwN7JFn0YVOMdtgcP0b+F8aMbkL0uoJXjutOMKNhssuW12kw
 9cmPjaBWm/bfrfoTUUB9mCh0Ub3HKpguYwTLQuf6Fyn2FK7oORpt87Zi+oIKUn6H
 pFXFtZYduLg6M2LXvZqsXZLXnvABPjANNWEhiiwrvuF/wmXXTwTpvRXlYXhCvpAN
 q0AhxPhPm3NnsmRhEB6SmoMjXyZIByezcEiqAspBrUvEqs/2u6VyzFMrXw==
 =PlsI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "ARM:

   - Ensure perf events programmed to count during guest execution are
     actually enabled before entering the guest in the nVHE
     configuration

   - Restore out-of-range handler for stage-2 translation faults

   - Several fixes to stage-2 TLB invalidations to avoid stale
     translations, possibly including partial walk caches

   - Fix early handling of architectural VHE-only systems to ensure E2H
     is appropriately set

   - Correct a format specifier warning in the arch_timer selftest

   - Make the KVM banner message correctly handle all of the possible
     configurations

  RISC-V:

   - Remove redundant semicolon in num_isa_ext_regs()

   - Fix APLIC setipnum_le/be write emulation

   - Fix APLIC in_clrip[x] read emulation

  x86:

   - Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID
     entries (old vs. new) and ultimately neglects to clear PV_UNHALT
     from vCPUs with HLT-exiting disabled

   - Documentation fixes for SEV

   - Fix compat ABI for KVM_MEMORY_ENCRYPT_OP

   - Fix a 14-year-old goof in a declaration shared by host and guest;
     the enabled field used by Linux when running as a guest pushes the
     size of "struct kvm_vcpu_pv_apf_data" from 64 to 68 bytes. This is
     really unconsequential because KVM never consumes anything beyond
     the first 64 bytes, but the resulting struct does not match the
     documentation

  Selftests:

   - Fix spelling mistake in arch_timer selftest"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
  KVM: arm64: Rationalise KVM banner output
  arm64: Fix early handling of FEAT_E2H0 not being implemented
  KVM: arm64: Ensure target address is granule-aligned for range TLBI
  KVM: arm64: Use TLBI_TTL_UNKNOWN in __kvm_tlb_flush_vmid_range()
  KVM: arm64: Don't pass a TLBI level hint when zapping table entries
  KVM: arm64: Don't defer TLB invalidation when zapping table entries
  KVM: selftests: Fix __GUEST_ASSERT() format warnings in ARM's arch timer test
  KVM: arm64: Fix out-of-IPA space translation fault handling
  KVM: arm64: Fix host-programmed guest events in nVHE
  RISC-V: KVM: Fix APLIC in_clrip[x] read emulation
  RISC-V: KVM: Fix APLIC setipnum_le/be write emulation
  RISC-V: KVM: Remove second semicolon
  KVM: selftests: Fix spelling mistake "trigged" -> "triggered"
  Documentation: kvm/sev: clarify usage of KVM_MEMORY_ENCRYPT_OP
  Documentation: kvm/sev: separate description of firmware
  KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OP
  KVM: selftests: Check that PV_UNHALT is cleared when HLT exiting is disabled
  KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT
  KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper
  KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES
  ...
2024-04-03 10:26:37 -07:00
Thorsten Blum
0049f04c7d x86/apic: Improve data types to fix Coccinelle warnings
Given that acpi_pm_read_early() returns a u32 (masked to 24 bits), several
variables that store its return value are improved by adjusting their data
types from unsigned long to u32. Specifically, change deltapm's type from
long to u32 because its value fits into 32 bits and it cannot be negative.

These data type improvements resolve the following two Coccinelle/
coccicheck warnings reported by do_div.cocci:

arch/x86/kernel/apic/apic.c:734:1-7: WARNING: do_div() does a 64-by-32
division, please consider using div64_long instead.

arch/x86/kernel/apic/apic.c:742:2-8: WARNING: do_div() does a 64-by-32
division, please consider using div64_long instead.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20240318104721.117741-3-thorsten.blum%40toblux.com
2024-04-03 08:32:04 -07:00
Andy Shevchenko
62fbc013c1 x86/rtc: Remove unused intel-mid.h
The rtc driver used to be disabled with a direct check for Intel MID
platforms.  But that direct check was replaced long ago (see second
link).  Remove the (unused since 2016) include.

[ dhansen: rewrite changelog to include some history ]

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20240305161024.1364098-1-andriy.shevchenko%40linux.intel.com
Link: https://lore.kernel.org/all/1460592286-300-5-git-send-email-mcgrof@kernel.org
2024-04-03 08:24:48 -07:00
Reinette Chatre
c3eeb1ffc6 x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offline
Tony encountered this OOPS when the last CPU of a domain goes
offline while running a kernel built with CONFIG_NO_HZ_FULL:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0
    Oops: 0000 [#1] PREEMPT SMP NOPTI
    ...
    RIP: 0010:__find_nth_andnot_bit+0x66/0x110
    ...
    Call Trace:
     <TASK>
     ? __die()
     ? page_fault_oops()
     ? exc_page_fault()
     ? asm_exc_page_fault()
     cpumask_any_housekeeping()
     mbm_setup_overflow_handler()
     resctrl_offline_cpu()
     resctrl_arch_offline_cpu()
     cpuhp_invoke_callback()
     cpuhp_thread_fun()
     smpboot_thread_fn()
     kthread()
     ret_from_fork()
     ret_from_fork_asm()
     </TASK>

The NULL pointer dereference is encountered while searching for another
online CPU in the domain (of which there are none) that can be used to
run the MBM overflow handler.

Because the kernel is configured with CONFIG_NO_HZ_FULL the search for
another CPU (in its effort to prefer those CPUs that aren't marked
nohz_full) consults the mask representing the nohz_full CPUs,
tick_nohz_full_mask. On a kernel with CONFIG_CPUMASK_OFFSTACK=y
tick_nohz_full_mask is not allocated unless the kernel is booted with
the "nohz_full=" parameter and because of that any access to
tick_nohz_full_mask needs to be guarded with tick_nohz_full_enabled().

Replace the IS_ENABLED(CONFIG_NO_HZ_FULL) with tick_nohz_full_enabled().
The latter ensures tick_nohz_full_mask can be accessed safely and can be
used whether kernel is built with CONFIG_NO_HZ_FULL enabled or not.

[ Use Ingo's suggestion that combines the two NO_HZ checks into one. ]

Fixes: a4846aaf39 ("x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow")
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/ff8dfc8d3dcb04b236d523d1e0de13d2ef585223.1711993956.git.reinette.chatre@intel.com
Closes: https://lore.kernel.org/lkml/ZgIFT5gZgIQ9A9G7@agluck-desk3/
2024-04-03 09:30:01 +02:00
Saurabh Sengar
f87136c057 x86/of: Change x86_dtb_parse_smp_config() to static
x86_dtb_parse_smp_config() is called locally only, change it to static.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/1712068830-4513-5-git-send-email-ssengar@linux.microsoft.com
2024-04-03 08:49:56 +02:00
Saurabh Sengar
85900d0618 x86/of: Map NUMA node to CPUs as per DeviceTree
Currently for DeviceTree bootup, x86 code does the default mapping of
CPUs to NUMA, which is wrong. This can cause incorrect mapping and WARNs
on SMT enabled systems:

  CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
  WARNING: CPU: 1 PID: 0 at topology_sane.isra.0+0x5c/0x6d
  match_smt+0xf6/0xfc
  set_cpu_sibling_map.cold+0x24f/0x512
  start_secondary+0x5c/0x110

Call the set_apicid_to_node() function in dtb_cpu_setup() for setting
the NUMA to CPU mapping for DeviceTree platforms.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/1712068830-4513-4-git-send-email-ssengar@linux.microsoft.com
2024-04-03 08:49:15 +02:00
Saurabh Sengar
222408cde4 x86/of: Set the parse_smp_cfg for all the DeviceTree platforms by default
x86_dtb_parse_smp_config() must be set by DeviceTree platform for
parsing SMP configuration. Set the parse_smp_cfg pointer to
x86_dtb_parse_smp_config() by default so that all the dtb platforms
need not to assign it explicitly. Today there are only two platforms
using DeviceTree in x86, ce4100 and hv_vtl. Remove the explicit
assignment of x86_dtb_parse_smp_config() function from these.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/1712068830-4513-3-git-send-email-ssengar@linux.microsoft.com
2024-04-03 08:46:08 +02:00
Paolo Bonzini
52b761b48f KVM/arm64 fixes for 6.9, part #1
- Ensure perf events programmed to count during guest execution
    are actually enabled before entering the guest in the nVHE
    configuration.
 
  - Restore out-of-range handler for stage-2 translation faults.
 
  - Several fixes to stage-2 TLB invalidations to avoid stale
    translations, possibly including partial walk caches.
 
  - Fix early handling of architectural VHE-only systems to ensure E2H is
    appropriately set.
 
  - Correct a format specifier warning in the arch_timer selftest.
 
  - Make the KVM banner message correctly handle all of the possible
    configurations.
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZgtpWBccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFoilAQCQk6kLIeuih5QOe50fK4XkNsyg
 PGcxw0a0BP8cfjtJsgEArwLlfHQOTE4tRWtXyEHvapJfe/bE1hjLmzUJx7BwLQ4=
 =6hNq
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.9, part #1

 - Ensure perf events programmed to count during guest execution
   are actually enabled before entering the guest in the nVHE
   configuration.

 - Restore out-of-range handler for stage-2 translation faults.

 - Several fixes to stage-2 TLB invalidations to avoid stale
   translations, possibly including partial walk caches.

 - Fix early handling of architectural VHE-only systems to ensure E2H is
   appropriately set.

 - Correct a format specifier warning in the arch_timer selftest.

 - Make the KVM banner message correctly handle all of the possible
   configurations.
2024-04-02 12:26:15 -04:00
Joan Bruguera Micó
6a53745300 x86/bpf: Fix IP for relocating call depth accounting
The commit:

  59bec00ace ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")

made PER_CPU_VAR() to use rip-relative addressing, hence
INCREMENT_CALL_DEPTH macro and skl_call_thunk_template got rip-relative
asm code inside of it. A follow up commit:

  17bce3b2ae ("x86/callthunks: Handle %rip-relative relocations in call thunk template")

changed x86_call_depth_emit_accounting() to use apply_relocation(),
but mistakenly assumed that the code is being patched in-place (where
the destination of the relocation matches the address of the code),
using *pprog as the destination ip. This is not true for the call depth
accounting, emitted by the BPF JIT, so the calculated address was wrong,
JIT-ed BPF progs on kernels with call depth tracking got broken and
usually caused a page fault.

Pass the destination IP when the BPF JIT emits call depth accounting.

Fixes: 17bce3b2ae ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240401185821.224068-3-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-01 20:37:56 -07:00
Linus Torvalds
448f828feb - Define the correct set of default hw events on AMD Zen4
- Use the correct stalled cycles PMCs on AMD Zen2 and newer
 
 - Fix detection of the LBR freeze feature on AMD
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYJR3oACgkQEsHwGGHe
 VUo/eRAAnGUq0rSi4ZUqsTtbu/jNepKbaeS7jR3/p3V6iSwfmjEoJ2xE6uIdN5vD
 fnL6UkeDRMc8LaKHIdLD4ZbN8NRa3hOyzf5K7wwVp5bwle0NeyrcG5wVK8LgT/X/
 rPSk7YxoR5frkYcA6zZwezJOv3HGYt8RMr5bKMD3YiJ35/XCdPsKnbHJTHb+F23Y
 tYFBeyzRzOebQu0fFKP8ML9LbqvELESqJ5Smwu/jQ25aBW7sFsUNAxseGU2tYahX
 c6pm8ytIlpZFwmi1HzXmMICF7lWugFO/KkP/ndCM1IpmujVGy56hrpLEy5gT3gzh
 NE/nZDoqJAO2zhg2FuKybh3akdT+IgXUTjxYMYGUOkJIChzie3o4p9OqichgTIv5
 +ngAq5qzjAHfC7cZ5nA96XWkw1fFU6BqlA3KPs1mzQU9uTDz7tSkyxIitp3C8L0B
 JlilTr6yHUprJzFwCDk4hb+hfP5A9qYnrNeacMlldZmbH1jLYHEzB9FudK82MeM+
 tIKFnM2jyRaRs/s8n+/UdrOVFNGk/+scX8GQllEBF451a8J5x1CYeHB7dGW+4pf/
 cx5TupHg8dDRgNMsbaeEvwERoPu4h/VRozfBi6r1WjjskVm24lIdFFKTSm3BDbLk
 EH3cflv/h8KE19cr0XLb7aYYw/9jb4cpnb0WBMw1gQOSvUMXzxU=
 =gmta
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf fixes from Borislav Petkov:

 - Define the correct set of default hw events on AMD Zen4

 - Use the correct stalled cycles PMCs on AMD Zen2 and newer

 - Fix detection of the LBR freeze feature on AMD

* tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and later
  perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later
  perf/x86/amd/lbr: Use freeze based on availability
  x86/cpufeatures: Add new word for scattered features
2024-03-31 10:43:11 -07:00
Julian Stecklina
4faa0e5d6d x86/boot: Move kernel cmdline setup earlier in the boot process (again)
When split_lock_detect=off (or similar) is specified in
CONFIG_CMDLINE, its effect is lost. The flow is currently this:

	setup_arch():
	  -> early_cpu_init()
	    -> early_identify_cpu()
	      -> sld_setup()
		-> sld_state_setup()
		  -> Looks for split_lock_detect in boot_command_line

	  -> e820__memory_setup()

	  -> Assemble final command line:
	     boot_command_line = builtin_cmdline + boot_cmdline

	  -> parse_early_param()

There were earlier attempts at fixing this in:

  8d48bf8206 ("x86/boot: Pull up cmdline preparation and early param parsing")

later reverted in:

  fbe6183998 ("Revert "x86/boot: Pull up cmdline preparation and early param parsing"")

... because parse_early_param() can't be called before
e820__memory_setup().

In this patch, we just move the command line concatenation to the
beginning of early_cpu_init(). This should fix sld_state_setup(), while
not running in the same issues as the earlier attempt.

The order is now:

	setup_arch():
	  -> Assemble final command line:
	     boot_command_line = builtin_cmdline + boot_cmdline

	  -> early_cpu_init()
	    -> early_identify_cpu()
	      -> sld_setup()
		-> sld_state_setup()
		  -> Looks for split_lock_detect in boot_command_line

	  -> e820__memory_setup()

	  -> parse_early_param()

Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-kernel@vger.kernel.org
2024-03-29 08:19:12 +01:00
Alex Shi
f9f62a877d x86/dumpstack: Use uniform "Oops: " prefix for die() messages
panic() prints a uniform prompt: "Kernel panic - not syncing:",
but die() messages don't have any of that, the message is the
raw user-defined message with no prefix.

There's companies that collect thousands of die() messages per week,
but w/o a prompt in dmesg, it's hard to write scripts to collect and
analize the reasons.

Add a uniform "Oops:" prefix like other architectures.

[ mingo: Rewrote changelog. ]

Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240327024419.471433-1-alexs@kernel.org
2024-03-27 08:45:19 +01:00
Kevin Loughlin
0f4a1e8098 x86/sev: Skip ROM range scans and validation for SEV-SNP guests
SEV-SNP requires encrypted memory to be validated before access.
Because the ROM memory range is not part of the e820 table, it is not
pre-validated by the BIOS. Therefore, if a SEV-SNP guest kernel wishes
to access this range, the guest must first validate the range.

The current SEV-SNP code does indeed scan the ROM range during early
boot and thus attempts to validate the ROM range in probe_roms().
However, this behavior is neither sufficient nor necessary for the
following reasons:

* With regards to sufficiency, if EFI_CONFIG_TABLES are not enabled and
  CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK is set, the kernel will
  attempt to access the memory at SMBIOS_ENTRY_POINT_SCAN_START (which
  falls in the ROM range) prior to validation.

  For example, Project Oak Stage 0 provides a minimal guest firmware
  that currently meets these configuration conditions, meaning guests
  booting atop Oak Stage 0 firmware encounter a problematic call chain
  during dmi_setup() -> dmi_scan_machine() that results in a crash
  during boot if SEV-SNP is enabled.

* With regards to necessity, SEV-SNP guests generally read garbage
  (which changes across boots) from the ROM range, meaning these scans
  are unnecessary. The guest reads garbage because the legacy ROM range
  is unencrypted data but is accessed via an encrypted PMD during early
  boot (where the PMD is marked as encrypted due to potentially mapping
  actually-encrypted data in other PMD-contained ranges).

In one exceptional case, EISA probing treats the ROM range as
unencrypted data, which is inconsistent with other probing.

Continuing to allow SEV-SNP guests to use garbage and to inconsistently
classify ROM range encryption status can trigger undesirable behavior.
For instance, if garbage bytes appear to be a valid signature, memory
may be unnecessarily reserved for the ROM range. Future code or other
use cases may result in more problematic (arbitrary) behavior that
should be avoided.

While one solution would be to overhaul the early PMD mapping to always
treat the ROM region of the PMD as unencrypted, SEV-SNP guests do not
currently rely on data from the ROM region during early boot (and even
if they did, they would be mostly relying on garbage data anyways).

As a simpler solution, skip the ROM range scans (and the otherwise-
necessary range validation) during SEV-SNP guest early boot. The
potential SEV-SNP guest crash due to lack of ROM range validation is
thus avoided by simply not accessing the ROM range.

In most cases, skip the scans by overriding problematic x86_init
functions during sme_early_init() to SNP-safe variants, which can be
likened to x86_init overrides done for other platforms (ex: Xen); such
overrides also avoid the spread of cc_platform_has() checks throughout
the tree.

In the exceptional EISA case, still use cc_platform_has() for the
simplest change, given (1) checks for guest type (ex: Xen domain status)
are already performed here, and (2) these checks occur in a subsys
initcall instead of an x86_init function.

  [ bp: Massage commit message, remove "we"s. ]

Fixes: 9704c07bf9 ("x86/kernel: Validate ROM memory before accessing when SEV-SNP is active")
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20240313121546.2964854-1-kevinloughlin@google.com
2024-03-26 15:22:35 +01:00
Tony Luck
108c6494bd x86/mce: Dynamically size space for machine check records
Systems with a large number of CPUs may generate a large number of
machine check records when things go seriously wrong. But Linux has
a fixed-size buffer that can only capture a few dozen errors.

Allocate space based on the number of CPUs (with a minimum value based
on the historical fixed buffer that could store 80 records).

  [ bp: Rename local var from tmpp to something more telling: gpool. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>
Link: https://lore.kernel.org/r/20240307192704.37213-1-tony.luck@intel.com
2024-03-26 12:40:42 +01:00
Paul E. McKenney
3186b61812 x86/nmi: Upgrade NMI backtrace stall checks & messages
The commit to improve NMI stall debuggability:

  344da544f1 ("x86/nmi: Print reasons why backtrace NMIs are ignored")

... has shown value, but widespread use has also identified a few
opportunities for improvement.

The systems have (as usual) shown far more creativity than that commit's
author, demonstrating yet again that failing CPUs can do whatever they want.

In addition, the current message format is less friendly than one might
like to those attempting to use these messages to identify failing CPUs.

Therefore, separately flag CPUs that, during the full time that the
stack-backtrace request was waiting, were always in an NMI handler,
were never in an NMI handler, or exited one NMI handler.

Also, split the message identifying the CPU and the time since that CPU's
last NMI-related activity so that a single line identifies the CPU without
any other variable information, greatly reducing the processing overhead
required to identify repeat-offender CPUs.

Co-developed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/ab4d70c8-c874-42dc-b206-643018922393@paulmck-laptop
2024-03-26 10:07:59 +01:00
Bingsong Si
cd2236c2f4 x86/cpu: Clear TME feature flag if TME is not enabled by BIOS
When TME is disabled by BIOS, the dmesg output is:

  x86/tme: not enabled by BIOS

... and TME functionality is not enabled by the kernel, but the TME feature
is still shown in /proc/cpuinfo.

Clear it.

[ mingo: Clarified changelog ]

Signed-off-by: Bingsong Si <sibs@chinatelecom.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "Huang, Kai" <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240311071938.13247-1-sibs@chinatelecom.cn
2024-03-26 09:49:32 +01:00
Yuntao Wang
c3262d3d19 x86/head: Simplify relative include path to xen-head.S
Fix the relative path specification in the include directives adding
xen-head.S to the kernel's head_*.S files since they both have
"arch/x86/" as prefix.

  [ bp: Rewrite commit message. ]

Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231231121904.24622-1-ytcoode@gmail.com
2024-03-25 15:24:10 +01:00
Borislav Petkov (AMD)
29ba89f189 x86/CPU/AMD: Improve the erratum 1386 workaround
Disable XSAVES only on machines which haven't loaded the microcode
revision containing the erratum fix.

This will come in handy when running archaic OSes as guests. OSes whose
brilliant programmers thought that CPUID is overrated and one should not
query it but use features directly, ala shoot first, ask questions
later... but only if you're alive after the shooting.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20240324200525.GBZgCHhYFsBj12PrKv@fat_crate.local
2024-03-25 14:05:24 +01:00
Sandipan Das
598c2fafc0 perf/x86/amd/lbr: Use freeze based on availability
Currently, the LBR code assumes that LBR Freeze is supported on all processors
when X86_FEATURE_AMD_LBR_V2 is available i.e. CPUID leaf 0x80000022[EAX]
bit 1 is set. This is incorrect as the availability of the feature is
additionally dependent on CPUID leaf 0x80000022[EAX] bit 2 being set,
which may not be set for all Zen 4 processors.

Define a new feature bit for LBR and PMC freeze and set the freeze enable bit
(FLBRI) in DebugCtl (MSR 0x1d9) conditionally.

It should still be possible to use LBR without freeze for profile-guided
optimization of user programs by using an user-only branch filter during
profiling. When the user-only filter is enabled, branches are no longer
recorded after the transition to CPL 0 upon PMI arrival. When branch
entries are read in the PMI handler, the branch stack does not change.

E.g.

  $ perf record -j any,u -e ex_ret_brn_tkn ./workload

Since the feature bit is visible under flags in /proc/cpuinfo, it can be
used to determine the feasibility of use-cases which require LBR Freeze
to be supported by the hardware such as profile-guided optimization of
kernels.

Fixes: ca5b7c0d96 ("perf/x86/amd/lbr: Add LbrExtV2 branch record support")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/69a453c97cfd11c6f2584b19f937fe6df741510f.1711091584.git.sandipan.das@amd.com
2024-03-25 11:16:55 +01:00
Linus Torvalds
5e74df2f8f A set of x86 fixes:
- Ensure that the encryption mask at boot is properly propagated on
     5-level page tables, otherwise the PGD entry is incorrectly set to
     non-encrypted, which causes system crashes during boot.
 
   - Undo the deferred 5-level page table setup as it cannot work with
     memory encryption enabled.
 
   - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
     to the default value but the cached variable is not, so subsequent
     comparisons might yield the wrong result and as a consequence the
     result prevents updating the MSR.
 
   - Register the local APIC address only once in the MPPARSE enumeration to
     prevent triggering the related WARN_ONs() in the APIC and topology code.
 
   - Handle the case where no APIC is found gracefully by registering a fake
     APIC in the topology code. That makes all related topology functions
     work correctly and does not affect the actual APIC driver code at all.
 
   - Don't evaluate logical IDs during early boot as the local APIC IDs are
     not yet enumerated and the invoked function returns an error
     code. Nothing requires the logical IDs before the final CPUID
     enumeration takes place, which happens after the enumeration.
 
   - Cure the fallout of the per CPU rework on UP which misplaced the
     copying of boot_cpu_data to per CPU data so that the final update to
     boot_cpu_data got lost which caused inconsistent state and boot
     crashes.
 
   - Use copy_from_kernel_nofault() in the kprobes setup as there is no
     guarantee that the address can be safely accessed.
 
   - Reorder struct members in struct saved_context to work around another
     kmemleak false positive
 
   - Remove the buggy code which tries to update the E820 kexec table for
     setup_data as that is never passed to the kexec kernel.
 
   - Update the resource control documentation to use the proper units.
 
   - Fix a Kconfig warning observed with tinyconfig
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmYAUH4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXzREAC/HVB7yzUEbjbh7dyYRBEgFU19bcyC
 JKf9HVmEHj03HstUxF1dxguUhwfHVPNTWpjmy/fRwxqgM9JG+QpV6T4DIldWqchv
 AUYFrQBMvql8hTKxRa/Ny75d2IqKPgEEGUuyU+ZHAzEEPwhKrbtVRDPuEiMxpd5I
 9B1Pya4EzUyOv1UhPIg7PRoya1msimBZ0mCw4In6ri6xVRm1uC3Ln4LZPylxn96l
 f77rz5UToUw0gfgDaezF0z4ml1phGEdSX0Z3hhD0PX12wbJGEdvPzL0qTgEq72Ad
 AeLmHx4K8z2zoHMHK7iTEwjoplQxGsWLoezh22cVEEJX0dtzHz6R0ftBCa6uzATJ
 C8FF1oDDHAhTL94YmVSTZHr6AdJ6LwgYHO3zXZUhxuB7PNXAT4FmT0zgU1fU3sC1
 U/1mIFdgOEUOlGll2Ra5uTUKc0K/dc+yC9dcbz37Kwj3KlfqTN+5BWocjySkHomr
 gcv37aU1TJGSC/D1lYWTDWGKVbbP5lk+KIGICT5SBKn0METa/wOo8dE6+T1kIwvS
 t2QTlJdzilLcWGVQ8GiNjjRxFtRKY5i9Shi4K+wUvCee4/XJzRrpxrCEY8w/qceV
 hc3kfUIon3TCv8+rnlSuNRZBvmFhXMYwMt0gQv4YywB+aOITKTzbGUOazLtRNKAH
 lFCnBRS55AB8mg==
 =WyQ2
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:

 - Ensure that the encryption mask at boot is properly propagated on
   5-level page tables, otherwise the PGD entry is incorrectly set to
   non-encrypted, which causes system crashes during boot.

 - Undo the deferred 5-level page table setup as it cannot work with
   memory encryption enabled.

 - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
   to the default value but the cached variable is not, so subsequent
   comparisons might yield the wrong result and as a consequence the
   result prevents updating the MSR.

 - Register the local APIC address only once in the MPPARSE enumeration
   to prevent triggering the related WARN_ONs() in the APIC and topology
   code.

 - Handle the case where no APIC is found gracefully by registering a
   fake APIC in the topology code. That makes all related topology
   functions work correctly and does not affect the actual APIC driver
   code at all.

 - Don't evaluate logical IDs during early boot as the local APIC IDs
   are not yet enumerated and the invoked function returns an error
   code. Nothing requires the logical IDs before the final CPUID
   enumeration takes place, which happens after the enumeration.

 - Cure the fallout of the per CPU rework on UP which misplaced the
   copying of boot_cpu_data to per CPU data so that the final update to
   boot_cpu_data got lost which caused inconsistent state and boot
   crashes.

 - Use copy_from_kernel_nofault() in the kprobes setup as there is no
   guarantee that the address can be safely accessed.

 - Reorder struct members in struct saved_context to work around another
   kmemleak false positive

 - Remove the buggy code which tries to update the E820 kexec table for
   setup_data as that is never passed to the kexec kernel.

 - Update the resource control documentation to use the proper units.

 - Fix a Kconfig warning observed with tinyconfig

* tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/64: Move 5-level paging global variable assignments back
  x86/boot/64: Apply encryption mask to 5-level pagetable update
  x86/cpu: Add model number for another Intel Arrow Lake mobile processor
  x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
  Documentation/x86: Document that resctrl bandwidth control units are MiB
  x86/mpparse: Register APIC address only once
  x86/topology: Handle the !APIC case gracefully
  x86/topology: Don't evaluate logical IDs during early boot
  x86/cpu: Ensure that CPU info updates are propagated on UP
  kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
  x86/pm: Work around false positive kmemleak report in msr_build_context()
  x86/kexec: Do not update E820 kexec table for setup_data
  x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
2024-03-24 11:13:56 -07:00
Tom Lendacky
9843231c97 x86/boot/64: Move 5-level paging global variable assignments back
Commit 63bed96604 ("x86/startup_64: Defer assignment of 5-level paging
global variables") moved assignment of 5-level global variables to later
in the boot in order to avoid having to use RIP relative addressing in
order to set them. However, when running with 5-level paging and SME
active (mem_encrypt=on), the variables are needed as part of the page
table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(),
etc.). Since the variables haven't been set, the page table manipulation
is done as if 4-level paging is active, causing the system to crash on
boot.

While only a subset of the assignments that were moved need to be set
early, move all of the assignments back into check_la57_support() so that
these assignments aren't spread between two locations. Instead of just
reverting the fix, this uses the new RIP_REL_REF() macro when assigning
the variables.

Fixes: 63bed96604 ("x86/startup_64: Defer assignment of 5-level paging global variables")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com
2024-03-24 05:00:36 +01:00
Tom Lendacky
4d0d7e7852 x86/boot/64: Apply encryption mask to 5-level pagetable update
When running with 5-level page tables, the kernel mapping PGD entry is
updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC,
which, when SME is active (mem_encrypt=on), results in a page table
entry without the encryption mask set, causing the system to crash on
boot.

Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so
that the encryption mask is set for the PGD entry.

Fixes: 533568e06b ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com
2024-03-24 05:00:35 +01:00
Adamos Ttofari
10e4b5166d x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
Commit 672365477a ("x86/fpu: Update XFD state where required") and
commit 8bf26758ca ("x86/fpu: Add XFD state to fpstate") introduced a
per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in
order to avoid unnecessary writes to the MSR.

On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which
wipes out any stale state. But the per CPU cached xfd value is not
reset, which brings them out of sync.

As a consequence a subsequent xfd_update_state() might fail to update
the MSR which in turn can result in XRSTOR raising a #NM in kernel
space, which crashes the kernel.

To fix this, introduce xfd_set_state() to write xfd_state together
with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD.

Fixes: 672365477a ("x86/fpu: Update XFD state where required")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com

Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de
2024-03-24 04:03:54 +01:00
Thomas Gleixner
f2208aa12c x86/mpparse: Register APIC address only once
The APIC address is registered twice. First during the early detection and
afterwards when actually scanning the table for APIC IDs. The APIC and
topology core warn about the second attempt.

Restrict it to the early detection call.

Fixes: 81287ad65d ("x86/apic: Sanitize APIC address setup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.297774848@linutronix.de
2024-03-23 12:41:48 +01:00
Thomas Gleixner
5e25eb25da x86/topology: Handle the !APIC case gracefully
If there is no local APIC enumerated and registered then the topology
bitmaps are empty. Therefore, topology_init_possible_cpus() will die with
a division by zero exception.

Prevent this by registering a fake APIC id to populate the topology
bitmap. This also allows to use all topology query interfaces
unconditionally. It does not affect the actual APIC code because either
the local APIC address was not registered or no local APIC could be
detected.

Fixes: f1f758a805 ("x86/topology: Add a mechanism to track topology via APIC IDs")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.242709302@linutronix.de
2024-03-23 12:35:56 +01:00
Thomas Gleixner
7af541cee1 x86/topology: Don't evaluate logical IDs during early boot
The local APICs have not yet been enumerated so the logical ID evaluation
from the topology bitmaps does not work and would return an error code.

Skip the evaluation during the early boot CPUID evaluation and only apply
it on the final run.

Fixes: 380414be78 ("x86/cpu/topology: Use topology logical mapping mechanism")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.186943142@linutronix.de
2024-03-23 12:28:06 +01:00
Thomas Gleixner
c90399fbd7 x86/cpu: Ensure that CPU info updates are propagated on UP
The boot sequence evaluates CPUID information twice:

  1) During early boot

  2) When finalizing the early setup right before
     mitigations are selected and alternatives are patched.

In both cases the evaluation is stored in boot_cpu_data, but on UP the
copying of boot_cpu_data to the per CPU info of the boot CPU happens
between #1 and #2. So any update which happens in #2 is never propagated to
the per CPU info instance.

Consolidate the whole logic and copy boot_cpu_data right before applying
alternatives as that's the point where boot_cpu_data is in it's final
state and not supposed to change anymore.

This also removes the voodoo mb() from smp_prepare_cpus_common() which
had absolutely no purpose.

Fixes: 71eb4893cf ("x86/percpu: Cure per CPU madness on UP")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.127642785@linutronix.de
2024-03-23 12:22:04 +01:00
Masami Hiramatsu (Google)
4e51653d5d kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
Read from an unsafe address with copy_from_kernel_nofault() in
arch_adjust_kprobe_addr() because this function is used before checking
the address is in text or not. Syzcaller bot found a bug and reported
the case if user specifies inaccessible data area,
arch_adjust_kprobe_addr() will cause a kernel panic.

[ mingo: Clarified the comment. ]

Fixes: cc66bb9145 ("x86/ibt,kprobes: Cure sym+0 equals fentry woes")
Reported-by: Qiang Zhang <zzqq0103.hey@gmail.com>
Tested-by: Jinghao Jia <jinghao7@illinois.edu>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/171042945004.154897.2221804961882915806.stgit@devnote2
2024-03-22 11:40:56 +01:00
Thomas Gleixner
63edbaa48a x86/cpu/topology: Add support for the AMD 0x80000026 leaf
On AMD processors that support extended CPUID leaf 0x80000026, use the
extended leaf to parse the topology information. In case of a failure,
fall back to parsing the information from CPUID leaf 0xb.

CPUID leaf 0x80000026 exposes the "CCX" and "CCD (Die)" information on
AMD processors which have been mapped to TOPO_TILE_DOMAIN and
TOPO_DIE_DOMAIN respectively.

Since this information was previously not available via CPUID leaf 0xb
or 0x8000001e, the "die_id", "logical_die_id", "max_die_per_pkg",
"die_cpus", and "die_cpus_list" will differ with this addition on
AMD processors that support extended CPUID leaf 0x80000026 and contain
more than one "CCD (Die)" on the package.

For example, following are the changes in the values reported by
"/sys/kernel/debug/x86/topo/cpus/16" after applying this patch on a 4th
Generation AMD EPYC System (1 x 128C/256T):

  (CPU16 is the first CPU of the second CCD on the package)

		   tip:x86/apic      tip:x86/apic
				     + this patch

  online:              1                  1
  initial_apicid:      80                 80
  apicid:              80                 80
  pkg_id:              0                  0
  die_id:              0                  4       *
  cu_id:               255                255
  core_id:             64                 64
  logical_pkg_id:      0                  0
  logical_die_id:      0                  4       *
  llc_id:              8                  8
  l2c_id:              65535              65535
  amd_node_id:         0                  0
  amd_nodes_per_pkg:   1                  1
  num_threads:         256                256
  num_cores:           128                128
  max_dies_per_pkg:    1                  8       *
  max_threads_per_core:2                  2

[ prateek: commit log, updated comment in topoext_amd.c, changed has_0xb
  to has_topoext, rebased the changes on tip:x86/apic, tested the
  changes on 4th Gen AMD EPYC system ]

[ mingo: tidy up the changelog a bit more ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240314050432.1710-1-kprateek.nayak@amd.com
2024-03-22 11:22:14 +01:00
Valentin Schneider
79a4567b2e x86/tsc: Make __use_tsc __ro_after_init
__use_tsc is only ever enabled in __init tsc_enable_sched_clock(), so mark
it as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240313180106.2917308-5-vschneid@redhat.com
2024-03-22 11:18:20 +01:00
Valentin Schneider
ddd8afacc4 x86/kvm: Make kvm_async_pf_enabled __ro_after_init
kvm_async_pf_enabled is only ever enabled in __init kvm_guest_init(), so
mark it as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240313180106.2917308-4-vschneid@redhat.com
2024-03-22 11:18:19 +01:00
H.J. Lu
2883f01ec3 x86/shstk: Enable shadow stacks for x32
1. Add shadow stack support to x32 signal.
2. Use the 64-bit map_shadow_stack syscall for x32.
3. Set up shadow stack for x32.

Tested with shadow stack enabled x32 glibc on Intel Tiger Lake:

I configured x32 glibc with --enable-cet, build glibc and
run all glibc tests with shadow stack enabled.  There are
no regressions.  I verified that shadow stack is enabled
via /proc/pid/status.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: H.J. Lu <hjl.tools@gmail.com>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20240315140433.1966543-1-hjl.tools@gmail.com
2024-03-22 10:17:11 +01:00
Dave Young
fc7f27cda8 x86/kexec: Do not update E820 kexec table for setup_data
crashkernel reservation failed on a Thinkpad t440s laptop recently.
Actually the memblock reservation succeeded, but later insert_resource()
failed.

Test steps:
  kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
    kexec reboot ->
        dmesg|grep "crashkernel reserved";
            crashkernel memory range like below reserved successfully:
              0x00000000d0000000 - 0x00000000da000000
        But no such "Crash kernel" region in /proc/iomem

The background story:

Currently the E820 code reserves setup_data regions for both the current
kernel and the kexec kernel, and it inserts them into the resources list.

Before the kexec kernel reboots nobody passes the old setup_data, and
kexec only passes fresh SETUP_EFI/SETUP_IMA/SETUP_RNG_SEED if needed.
Thus the old setup data memory is not used at all.

Due to old kernel updates the kexec e820 table as well so kexec kernel
sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
regions are inserted into resources list in the kexec kernel by
e820__reserve_resources().

Note, due to no setup_data is passed in for those old regions they are not
early reserved (by function early_reserve_memory), and the crashkernel
memblock reservation will just treat them as usable memory and it could
reserve the crashkernel region which overlaps with the old setup_data
regions. And just like the bug I noticed here, kdump insert_resource
failed because e820__reserve_resources has added the overlapped chunks
in /proc/iomem already.

Finally, looking at the code, the old setup_data regions are not used
at all as no setup_data is passed in by the kexec boot loader. Although
something like SETUP_PCI etc could be needed, kexec should pass
the info as new setup_data so that kexec kernel can take care of them.
This should be taken care of in other separate patches if needed.

Thus drop the useless buggy code here.

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Eric DeVolder <eric.devolder@oracle.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/Zf0T3HCG-790K-pZ@darkstar.users.ipa.redhat.com
2024-03-22 10:07:45 +01:00
Brian Gerst
e2d168328e x86/syscall/compat: Remove ia32_unistd.h
This header is now just a wrapper for unistd_32_ia32.h.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240321211847.132473-3-brgerst@gmail.com
2024-03-22 09:37:09 +01:00
Xin Li (Intel)
8f69cba096 x86: Rename __{start,end}_init_task to __{start,end}_init_stack
The stack of a task has been separated from the memory of a task_struct
struture for a long time on x86, as a result __{start,end}_init_task no
longer mark the start and end of the init_task structure, but its stack
only.

Rename __{start,end}_init_task to __{start,end}_init_stack.

Note other architectures are not affected because __{start,end}_init_task
are used on x86 only.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20240322081616.3346181-1-xin@zytor.com
2024-03-22 09:32:41 +01:00
Borislav Petkov (AMD)
95bfb35269 x86/cpu: Get rid of an unnecessary local variable in get_cpu_address_sizes()
Drop 'vp_bits_from_cpuid' as it is not really needed.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240316120706.4352-1-bp@alien8.de
2024-03-21 21:13:56 +01:00
Rafael J. Wysocki
edf66a3c76 x86/cpu: Move leftover contents of topology.c to setup.c
The only useful piece of arch/x86/kernel/topology.c is the definition
of arch_cpu_is_hotpluggable() that can be moved elsewhere (other
architectures tend to put it into setup.c), so do that and delete
the rest of the file.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/12422874.O9o76ZdvQC@kreacher
2024-03-21 20:47:40 +01:00
Brian Gerst
2cb16181a1 x86/boot: Simplify boot stack setup
Define the symbol __top_init_kernel_stack instead of duplicating
the offset from __end_init_task in multiple places.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20240321180506.89030-1-brgerst@gmail.com
2024-03-21 20:17:54 +01:00
Linus Torvalds
cfce216e14 hyperv-next for v6.9
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmX7sYwTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXiMeCADAUfjuJyU1jrQxjXv0U9u0tng77FAt
 iT3+YFLR2Y4l8KRjD6Tpyk4fl/VN5VbJv1zPtSdNaViyri15gJjV7iMPujkx/pqO
 pxNfbOVZG7VeKMrudJzP2BHN2mAf8N0qyuVTFyMwLO5EtJrY44t4PtkA1r5cO6Pc
 eyoJWBofxH7XjkhOAMk4I3LXZMrq+hmtJ31G3eek6v/VjD1PtxU4f6/gJiqK9fz6
 ssvSfII0aCIKman5sYlhl11TO8omz/68L4db25ZLDSCdOrE5ZlQykmUshluuoesw
 eTUiuUZEh1O42Lsq7/hdUh+dSVGdTLHa9NKRQyWcruZiZ1idoZIA74ZW
 =4vOw
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20240320' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - Use Hyper-V entropy to seed guest random number generator (Michael
   Kelley)

 - Convert to platform remove callback returning void for vmbus (Uwe
   Kleine-König)

 - Introduce hv_get_hypervisor_version function (Nuno Das Neves)

 - Rename some HV_REGISTER_* defines for consistency (Nuno Das Neves)

 - Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_* (Nuno Das
   Neves)

 - Cosmetic changes for hv_spinlock.c (Purna Pavan Chandra Aekkaladevi)

 - Use per cpu initial stack for vtl context (Saurabh Sengar)

* tag 'hyperv-next-signed-20240320' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Use Hyper-V entropy to seed guest random number generator
  x86/hyperv: Cosmetic changes for hv_spinlock.c
  hyperv-tlfs: Rename some HV_REGISTER_* defines for consistency
  hv: vmbus: Convert to platform remove callback returning void
  mshyperv: Introduce hv_get_hypervisor_version function
  x86/hyperv: Use per cpu initial stack for vtl context
  hyperv-tlfs: Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_*
2024-03-21 10:01:02 -07:00
Uros Bizjak
5d31174f3c x86/fpu: Fix AMD X86_BUG_FXSAVE_LEAK fixup
The assembly snippet in restore_fpregs_from_fpstate() that implements
X86_BUG_FXSAVE_LEAK fixup loads the value from a random variable,
preferably the one that is already in the L1 cache.

However, the access to fpinit_state via *fpstate pointer is not
implemented correctly. The "m" asm constraint requires dereferenced
pointer variable, otherwise the compiler just reloads the value
via temporary stack slot. The current asm code reflects this:

     mov    %rdi,(%rsp)
     ...
     fildl  (%rsp)

With dereferenced pointer variable, the code does what the
comment above the asm snippet says:

     fildl  (%rdi)

Also, remove the pointless %P operand modifier. The modifier is
ineffective on non-symbolic references - it was used to prevent
%rip-relative addresses in .altinstr sections, but FILDL in the
.text section can use %rip-relative addresses without problems.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240315081849.5187-1-ubizjak@gmail.com
2024-03-19 14:02:29 +01:00
Paolo Bonzini
c822a075ab Guest-side KVM async #PF ABI cleanup for 6.9
Delete kvm_vcpu_pv_apf_data.enabled to fix a goof in KVM's async #PF ABI where
 the enabled field pushes the size of "struct kvm_vcpu_pv_apf_data" from 64 to
 68 bytes, i.e. beyond a single cache line.
 
 The enabled field is purely a guest-side flag that Linux-as-a-guest uses to
 track whether or not the guest has enabled async #PF support.  The actual flag
 that is passed to the host, i.e. to KVM proper, is a single bit in a synthetic
 MSR, MSR_KVM_ASYNC_PF_EN, i.e. is in a location completely unrelated to the
 shared kvm_vcpu_pv_apf_data structure.
 
 Simply drop the the field and use a dedicated guest-side per-CPU variable to
 fix the ABI, as opposed to fixing the documentation to match reality.  KVM has
 never consumed kvm_vcpu_pv_apf_data.enabled, so the odds of the ABI change
 breaking anything are extremely low.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXZBsMACgkQOlYIJqCj
 N/1H5g/8CgK81MpaTI4CsCf0rwD4orhmghAnJmllJHi676dteUm7gYzbDE8wajym
 rS7gtJwqe6cnK7hJt7SH31sfDEhYds43wD7o6VrLewjWCgaZ7YilYb+qJhzGOUt5
 OxQwzZu/57hOhXFFS7P7ZamgkiQu05IYLuK5BSWQbsuMLaGkA+uWoNKopr5588VW
 MQhR4jVCQSEdgYakgpy+TjWVi4/usiHHCFhcGV54ErKAKL/nCjyUOrgApINTzawQ
 Czh3ZAKMo6UanHOB6lZACc3MdSOTooDnIItzWOFDMJSLW376tmC70OGI42qi3ht6
 CB5zoUN9p4WyQkb7BluJ40PTmpNPEQQVglmU0bjVAKuGmDZ6YgkQ1OWAap6mH+q1
 JOzuFgXMXP+aCYXfeZYHedmPsqW+BJ4dd9vOtnoFE7sgCMye26gFb45wbuTWPFpX
 LcjykG6YUJJI/LcIc3i68onHPn7RI9XXOIVCyAh39zclCPkIKrlI8RKMlg2yBIdv
 pkLYHUsXRJ+02GHd7YQGFe6ph1rHs3P5LsNoUh8cLetGharww2fqpuAVDwftMvAg
 MG3zgA6BGv4bpHDNjGPEh+3g36d9C6hOheek2Wgjwy7zF6JxQme4UsXzecqETT5o
 j7LxLfjUaPzAvfTlGA9jZYO3X7tqpJomj1YxQQEd2p/36nGR+3k=
 =3ujw
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-asyncpf_abi-6.9' of https://github.com/kvm-x86/linux into HEAD

Guest-side KVM async #PF ABI cleanup for 6.9

Delete kvm_vcpu_pv_apf_data.enabled to fix a goof in KVM's async #PF ABI where
the enabled field pushes the size of "struct kvm_vcpu_pv_apf_data" from 64 to
68 bytes, i.e. beyond a single cache line.

The enabled field is purely a guest-side flag that Linux-as-a-guest uses to
track whether or not the guest has enabled async #PF support.  The actual flag
that is passed to the host, i.e. to KVM proper, is a single bit in a synthetic
MSR, MSR_KVM_ASYNC_PF_EN, i.e. is in a location completely unrelated to the
shared kvm_vcpu_pv_apf_data structure.

Simply drop the the field and use a dedicated guest-side per-CPU variable to
fix the ABI, as opposed to fixing the documentation to match reality.  KVM has
never consumed kvm_vcpu_pv_apf_data.enabled, so the odds of the ABI change
breaking anything are extremely low.
2024-03-18 19:03:42 -04:00
Michael Kelley
f2580a907e x86/hyperv: Use Hyper-V entropy to seed guest random number generator
A Hyper-V host provides its guest VMs with entropy in a custom ACPI
table named "OEM0".  The entropy bits are updated each time Hyper-V
boots the VM, and are suitable for seeding the Linux guest random
number generator (rng). See a brief description of OEM0 in [1].

Generation 2 VMs on Hyper-V use UEFI to boot. Existing EFI code in
Linux seeds the rng with entropy bits from the EFI_RNG_PROTOCOL.
Via this path, the rng is seeded very early during boot with good
entropy. The ACPI OEM0 table provided in such VMs is an additional
source of entropy.

Generation 1 VMs on Hyper-V boot from BIOS. For these VMs, Linux
doesn't currently get any entropy from the Hyper-V host. While this
is not fundamentally broken because Linux can generate its own entropy,
using the Hyper-V host provided entropy would get the rng off to a
better start and would do so earlier in the boot process.

Improve the rng seeding for Generation 1 VMs by having Hyper-V specific
code in Linux take advantage of the OEM0 table to seed the rng. For
Generation 2 VMs, use the OEM0 table to provide additional entropy
beyond the EFI_RNG_PROTOCOL. Because the OEM0 table is custom to
Hyper-V, parse it directly in the Hyper-V code in the Linux kernel
and use add_bootloader_randomness() to add it to the rng. Once the
entropy bits are read from OEM0, zero them out in the table so
they don't appear in /sys/firmware/acpi/tables/OEM0 in the running
VM. The zero'ing is done out of an abundance of caution to avoid
potential security risks to the rng. Also set the OEM0 data length
to zero so a kexec or other subsequent use of the table won't try
to use the zero'ed bits.

[1] https://download.microsoft.com/download/1/c/9/1c9813b8-089c-4fef-b2ad-ad80e79403ba/Whitepaper%20-%20The%20Windows%2010%20random%20number%20generation%20infrastructure.pdf

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20240318155408.216851-1-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240318155408.216851-1-mhklinux@outlook.com>
2024-03-18 22:01:52 +00:00
Nuno Das Neves
b967df6293 hyperv-tlfs: Rename some HV_REGISTER_* defines for consistency
Rename HV_REGISTER_GUEST_OSID to HV_REGISTER_GUEST_OS_ID. This matches
the existing HV_X64_MSR_GUEST_OS_ID.

Rename HV_REGISTER_CRASH_* to HV_REGISTER_GUEST_CRASH_*. Including
GUEST_ is consistent with other #defines such as
HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE. The new names also match the TLFS
document more accurately, i.e. HvRegisterGuestCrash*.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Link: https://lore.kernel.org/r/1710285687-9160-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1710285687-9160-1-git-send-email-nunodasneves@linux.microsoft.com>
2024-03-18 04:58:49 +00:00
Borislav Petkov (AMD)
5c84b051bd x86/CPU/AMD: Update the Zenbleed microcode revisions
Update them to the correct revision numbers.

Fixes: 522b1d6921 ("x86/cpu/amd: Add a Zenbleed fix")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-16 09:04:09 -07:00
Linus Torvalds
4f712ee0cb S390:
* Changes to FPU handling came in via the main s390 pull request
 
 * Only deliver to the guest the SCLP events that userspace has
   requested.
 
 * More virtual vs physical address fixes (only a cleanup since
   virtual and physical address spaces are currently the same).
 
 * Fix selftests undefined behavior.
 
 x86:
 
 * Fix a restriction that the guest can't program a PMU event whose
   encoding matches an architectural event that isn't included in the
   guest CPUID.  The enumeration of an architectural event only says
   that if a CPU supports an architectural event, then the event can be
   programmed *using the architectural encoding*.  The enumeration does
   NOT say anything about the encoding when the CPU doesn't report support
   the event *in general*.  It might support it, and it might support it
   using the same encoding that made it into the architectural PMU spec.
 
 * Fix a variety of bugs in KVM's emulation of RDPMC (more details on
   individual commits) and add a selftest to verify KVM correctly emulates
   RDMPC, counter availability, and a variety of other PMC-related
   behaviors that depend on guest CPUID and therefore are easier to
   validate with selftests than with custom guests (aka kvm-unit-tests).
 
 * Zero out PMU state on AMD if the virtual PMU is disabled, it does not
   cause any bug but it wastes time in various cases where KVM would check
   if a PMC event needs to be synthesized.
 
 * Optimize triggering of emulated events, with a nice ~10% performance
   improvement in VM-Exit microbenchmarks when a vPMU is exposed to the
   guest.
 
 * Tighten the check for "PMI in guest" to reduce false positives if an NMI
   arrives in the host while KVM is handling an IRQ VM-Exit.
 
 * Fix a bug where KVM would report stale/bogus exit qualification information
   when exiting to userspace with an internal error exit code.
 
 * Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
 
 * Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
   read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
 
 * Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB).  KVM
   doesn't support yielding in the middle of processing a zap, and 1GiB
   granularity resulted in multi-millisecond lags that are quite impolite
   for CONFIG_PREEMPT kernels.
 
 * Allocate write-tracking metadata on-demand to avoid the memory overhead when
   a kernel is built with i915 virtualization support but the workloads use
   neither shadow paging nor i915 virtualization.
 
 * Explicitly initialize a variety of on-stack variables in the emulator that
   triggered KMSAN false positives.
 
 * Fix the debugregs ABI for 32-bit KVM.
 
 * Rework the "force immediate exit" code so that vendor code ultimately decides
   how and when to force the exit, which allowed some optimization for both
   Intel and AMD.
 
 * Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
   vCPU creation ultimately failed, causing extra unnecessary work.
 
 * Cleanup the logic for checking if the currently loaded vCPU is in-kernel.
 
 * Harden against underflowing the active mmu_notifier invalidation
   count, so that "bad" invalidations (usually due to bugs elsehwere in the
   kernel) are detected earlier and are less likely to hang the kernel.
 
 x86 Xen emulation:
 
 * Overlay pages can now be cached based on host virtual address,
   instead of guest physical addresses.  This removes the need to
   reconfigure and invalidate the cache if the guest changes the
   gpa but the underlying host virtual address remains the same.
 
 * When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.
 
 * Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).
 
 * Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.
 
 RISC-V:
 
 * Support exception and interrupt handling in selftests
 
 * New self test for RISC-V architectural timer (Sstc extension)
 
 * New extension support (Ztso, Zacas)
 
 * Support userspace emulation of random number seed CSRs.
 
 ARM:
 
 * Infrastructure for building KVM's trap configuration based on the
   architectural features (or lack thereof) advertised in the VM's ID
   registers
 
 * Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
   x86's WC) at stage-2, improving the performance of interacting with
   assigned devices that can tolerate it
 
 * Conversion of KVM's representation of LPIs to an xarray, utilized to
   address serialization some of the serialization on the LPI injection
   path
 
 * Support for _architectural_ VHE-only systems, advertised through the
   absence of FEAT_E2H0 in the CPU's ID register
 
 * Miscellaneous cleanups, fixes, and spelling corrections to KVM and
   selftests
 
 LoongArch:
 
 * Set reserved bits as zero in CPUCFG.
 
 * Start SW timer only when vcpu is blocking.
 
 * Do not restart SW timer when it is expired.
 
 * Remove unnecessary CSR register saving during enter guest.
 
 * Misc cleanups and fixes as usual.
 
 Generic:
 
 * cleanup Kconfig by removing CONFIG_HAVE_KVM, which was basically always
   true on all architectures except MIPS (where Kconfig determines the
   available depending on CPU capabilities).  It is replaced either by
   an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM)
   everywhere else.
 
 * Factor common "select" statements in common code instead of requiring
   each architecture to specify it
 
 * Remove thoroughly obsolete APIs from the uapi headers.
 
 * Move architecture-dependent stuff to uapi/asm/kvm.h
 
 * Always flush the async page fault workqueue when a work item is being
   removed, especially during vCPU destruction, to ensure that there are no
   workers running in KVM code when all references to KVM-the-module are gone,
   i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded.
 
 * Grab a reference to the VM's mm_struct in the async #PF worker itself instead
   of gifting the worker a reference, so that there's no need to remember
   to *conditionally* clean up after the worker.
 
 Selftests:
 
 * Reduce boilerplate especially when utilize selftest TAP infrastructure.
 
 * Add basic smoke tests for SEV and SEV-ES, along with a pile of library
   support for handling private/encrypted/protected memory.
 
 * Fix benign bugs where tests neglect to close() guest_memfd files.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmX0iP8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroND7wf+JZoNvwZ+bmwWe/4jn/YwNoYi/C5z
 eypn8M1gsWEccpCpqPBwznVm9T29rF4uOlcMvqLEkHfTpaL1EKUUjP1lXPz/ileP
 6a2RdOGxAhyTiFC9fjy+wkkjtLbn1kZf6YsS0hjphP9+w0chNbdn0w81dFVnXryd
 j7XYI8R/bFAthNsJOuZXSEjCfIHxvTTG74OrTf1B1FEBB+arPmrgUeJftMVhffQK
 Sowgg8L/Ii/x6fgV5NZQVSIyVf1rp8z7c6UaHT4Fwb0+RAMW8p9pYv9Qp1YkKp8y
 5j0V9UzOHP7FRaYimZ5BtwQoqiZXYylQ+VuU/Y2f4X85cvlLzSqxaEMAPA==
 =mqOV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "S390:

   - Changes to FPU handling came in via the main s390 pull request

   - Only deliver to the guest the SCLP events that userspace has
     requested

   - More virtual vs physical address fixes (only a cleanup since
     virtual and physical address spaces are currently the same)

   - Fix selftests undefined behavior

  x86:

   - Fix a restriction that the guest can't program a PMU event whose
     encoding matches an architectural event that isn't included in the
     guest CPUID. The enumeration of an architectural event only says
     that if a CPU supports an architectural event, then the event can
     be programmed *using the architectural encoding*. The enumeration
     does NOT say anything about the encoding when the CPU doesn't
     report support the event *in general*. It might support it, and it
     might support it using the same encoding that made it into the
     architectural PMU spec

   - Fix a variety of bugs in KVM's emulation of RDPMC (more details on
     individual commits) and add a selftest to verify KVM correctly
     emulates RDMPC, counter availability, and a variety of other
     PMC-related behaviors that depend on guest CPUID and therefore are
     easier to validate with selftests than with custom guests (aka
     kvm-unit-tests)

   - Zero out PMU state on AMD if the virtual PMU is disabled, it does
     not cause any bug but it wastes time in various cases where KVM
     would check if a PMC event needs to be synthesized

   - Optimize triggering of emulated events, with a nice ~10%
     performance improvement in VM-Exit microbenchmarks when a vPMU is
     exposed to the guest

   - Tighten the check for "PMI in guest" to reduce false positives if
     an NMI arrives in the host while KVM is handling an IRQ VM-Exit

   - Fix a bug where KVM would report stale/bogus exit qualification
     information when exiting to userspace with an internal error exit
     code

   - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support

   - Rework TDP MMU root unload, free, and alloc to run with mmu_lock
     held for read, e.g. to avoid serializing vCPUs when userspace
     deletes a memslot

   - Tear down TDP MMU page tables at 4KiB granularity (used to be
     1GiB). KVM doesn't support yielding in the middle of processing a
     zap, and 1GiB granularity resulted in multi-millisecond lags that
     are quite impolite for CONFIG_PREEMPT kernels

   - Allocate write-tracking metadata on-demand to avoid the memory
     overhead when a kernel is built with i915 virtualization support
     but the workloads use neither shadow paging nor i915 virtualization

   - Explicitly initialize a variety of on-stack variables in the
     emulator that triggered KMSAN false positives

   - Fix the debugregs ABI for 32-bit KVM

   - Rework the "force immediate exit" code so that vendor code
     ultimately decides how and when to force the exit, which allowed
     some optimization for both Intel and AMD

   - Fix a long-standing bug where kvm_has_noapic_vcpu could be left
     elevated if vCPU creation ultimately failed, causing extra
     unnecessary work

   - Cleanup the logic for checking if the currently loaded vCPU is
     in-kernel

   - Harden against underflowing the active mmu_notifier invalidation
     count, so that "bad" invalidations (usually due to bugs elsehwere
     in the kernel) are detected earlier and are less likely to hang the
     kernel

  x86 Xen emulation:

   - Overlay pages can now be cached based on host virtual address,
     instead of guest physical addresses. This removes the need to
     reconfigure and invalidate the cache if the guest changes the gpa
     but the underlying host virtual address remains the same

   - When possible, use a single host TSC value when computing the
     deadline for Xen timers in order to improve the accuracy of the
     timer emulation

   - Inject pending upcall events when the vCPU software-enables its
     APIC to fix a bug where an upcall can be lost (and to follow Xen's
     behavior)

   - Fall back to the slow path instead of warning if "fast" IRQ
     delivery of Xen events fails, e.g. if the guest has aliased xAPIC
     IDs

  RISC-V:

   - Support exception and interrupt handling in selftests

   - New self test for RISC-V architectural timer (Sstc extension)

   - New extension support (Ztso, Zacas)

   - Support userspace emulation of random number seed CSRs

  ARM:

   - Infrastructure for building KVM's trap configuration based on the
     architectural features (or lack thereof) advertised in the VM's ID
     registers

   - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
     x86's WC) at stage-2, improving the performance of interacting with
     assigned devices that can tolerate it

   - Conversion of KVM's representation of LPIs to an xarray, utilized
     to address serialization some of the serialization on the LPI
     injection path

   - Support for _architectural_ VHE-only systems, advertised through
     the absence of FEAT_E2H0 in the CPU's ID register

   - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
     selftests

  LoongArch:

   - Set reserved bits as zero in CPUCFG

   - Start SW timer only when vcpu is blocking

   - Do not restart SW timer when it is expired

   - Remove unnecessary CSR register saving during enter guest

   - Misc cleanups and fixes as usual

  Generic:

   - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
     always true on all architectures except MIPS (where Kconfig
     determines the available depending on CPU capabilities). It is
     replaced either by an architecture-dependent symbol for MIPS, and
     IS_ENABLED(CONFIG_KVM) everywhere else

   - Factor common "select" statements in common code instead of
     requiring each architecture to specify it

   - Remove thoroughly obsolete APIs from the uapi headers

   - Move architecture-dependent stuff to uapi/asm/kvm.h

   - Always flush the async page fault workqueue when a work item is
     being removed, especially during vCPU destruction, to ensure that
     there are no workers running in KVM code when all references to
     KVM-the-module are gone, i.e. to prevent a very unlikely
     use-after-free if kvm.ko is unloaded

   - Grab a reference to the VM's mm_struct in the async #PF worker
     itself instead of gifting the worker a reference, so that there's
     no need to remember to *conditionally* clean up after the worker

  Selftests:

   - Reduce boilerplate especially when utilize selftest TAP
     infrastructure

   - Add basic smoke tests for SEV and SEV-ES, along with a pile of
     library support for handling private/encrypted/protected memory

   - Fix benign bugs where tests neglect to close() guest_memfd files"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
  selftests: kvm: remove meaningless assignments in Makefiles
  KVM: riscv: selftests: Add Zacas extension to get-reg-list test
  RISC-V: KVM: Allow Zacas extension for Guest/VM
  KVM: riscv: selftests: Add Ztso extension to get-reg-list test
  RISC-V: KVM: Allow Ztso extension for Guest/VM
  RISC-V: KVM: Forward SEED CSR access to user space
  KVM: riscv: selftests: Add sstc timer test
  KVM: riscv: selftests: Change vcpu_has_ext to a common function
  KVM: riscv: selftests: Add guest helper to get vcpu id
  KVM: riscv: selftests: Add exception handling support
  LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
  LoongArch: KVM: Do not restart SW timer when it is expired
  LoongArch: KVM: Start SW timer only when vcpu is blocking
  LoongArch: KVM: Set reserved bits as zero in CPUCFG
  KVM: selftests: Explicitly close guest_memfd files in some gmem tests
  KVM: x86/xen: fix recursive deadlock in timer injection
  KVM: pfncache: simplify locking and make more self-contained
  KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
  KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
  KVM: x86/xen: improve accuracy of Xen timers
  ...
2024-03-15 13:03:13 -07:00
Linus Torvalds
ab522e1478 Devicetree updates for v6.9:
DT core:
 
 - Add cleanup.h based auto release of struct device_node pointers via
   __free marking and new for_each_child_of_node_scoped() iterator to use
   it.
 
 - Always create a base skeleton DT when CONFIG_OF is enabled. This
   supports several usecases of adding DT data on non-DT booted systems.
 
 - Move around some /reserved-memory code in preparation for further
   improvements
 
 - Add a stub for_each_property_of_node() for !OF
 
 - Adjust the printk levels on some messages
 
 - Fix __be32 sparse warning
 
 - Drop RESERVEDMEM_OF_DECLARE usage from Freescale qbman driver
   (currently orphaned)
 
 - Add Saravana Kannan and drop Frank Rowand as DT maintainers
 
 DT bindings:
 
 - Convert Mediatek timer, Mediatek sysirq, fsl,imx6ul-tsc,
   fsl,imx6ul-pinctrl, Atmel AIC, Atmel HLCDC, FPGA region, and
   xlnx,sd-fec to DT schemas
 
 - Add existing, but undocumented fsl,imx-anatop binding
 
 - Add bunch of undocumented vendor prefixes used in compatible strings
 
 - Drop obsolete brcm,bcm2835-pm-wdt binding
 
 - Drop obsolete i2c.txt which as been replaced with schema in dtschema
 
 - Add DPS310 device and sort trivial-devices.yaml
 
 - Enable undocumented compatible checks on DT binding examples
 
 - More QCom maintainer fixes/updates
 
 - Updates to writing-schema.rst and DT submitting-patches.rst to cover
   some frequent review comments
 
 - Clean-up SPDX tags to use 'OR' rather than 'or'
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmX0foEACgkQ+vtdtY28
 YcOkUg//T5Q+ZudVn/oJGre3crfPU4O/RHbG+brbwpBZEdiwTGlIjI8ceThjumCO
 MY25yRewCIZtS8MLlRb/lNPUjQxPeyYWnpO3KZHbOJhU8bJCl2M5P0CQOYJNp0fl
 fMFhFU5bKVoXyK6y3qx7ivZTXSBCz9KzB1HxY3LueMHVgWiO1Oi++XjLfcos86Mh
 7dKZKNbpcnBFkXiESMksQS+asZkoRtZloFg4iFjniSLa8AgYJLsZXd7iW4s0IXy+
 Xj+5IcIRcPv2xQoXfCvlcKMheJyePDA1coYpO8pmOYOpjCQzsCnnbzoNERW6hc9u
 0DF2IWnq9WLlQ8RVijbECRPgwW6zuU+aklUZLz2q0AiwCVySHaMdC9iYe+KK/7GH
 m0F21x5mpfK0LVfOMWLsmuqKWn9J164VAeTY9zHqcWuvCohD5ulftvQgRBEiSDtv
 V3l668t6v67iMkYa8SncbuMkV/NSShWPGne+yP3smvL0pe0P0MJYb1XSstlbNXuK
 whTDaCydEHx3JPJ6VS/1aJnELFm+uZVl8wjhfrgbWo2hIC83qjN3k0yV+vFNdFzT
 5PUfI858fvgYOrGsswYCCJXmb/s37NImCnIF/sjqvj50BA468261KYAFtapa2Vlj
 uvpKgIZHJEDOK6TPlk5n7+aaOwoLMYzm+yov/0gyRpRKqsXu52U=
 =YzNN
 -----END PGP SIGNATURE-----

Merge tag 'devicetree-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree updates from Rob Herring:
 "DT core:

   - Add cleanup.h based auto release of struct device_node pointers via
     __free marking and new for_each_child_of_node_scoped() iterator to
     use it.

   - Always create a base skeleton DT when CONFIG_OF is enabled. This
     supports several usecases of adding DT data on non-DT booted
     systems.

   - Move around some /reserved-memory code in preparation for further
     improvements

   - Add a stub for_each_property_of_node() for !OF

   - Adjust the printk levels on some messages

   - Fix __be32 sparse warning

   - Drop RESERVEDMEM_OF_DECLARE usage from Freescale qbman driver
     (currently orphaned)

   - Add Saravana Kannan and drop Frank Rowand as DT maintainers

  DT bindings:

   - Convert Mediatek timer, Mediatek sysirq, fsl,imx6ul-tsc,
     fsl,imx6ul-pinctrl, Atmel AIC, Atmel HLCDC, FPGA region, and
     xlnx,sd-fec to DT schemas

   - Add existing, but undocumented fsl,imx-anatop binding

   - Add bunch of undocumented vendor prefixes used in compatible
     strings

   - Drop obsolete brcm,bcm2835-pm-wdt binding

   - Drop obsolete i2c.txt which as been replaced with schema in
     dtschema

   - Add DPS310 device and sort trivial-devices.yaml

   - Enable undocumented compatible checks on DT binding examples

   - More QCom maintainer fixes/updates

   - Updates to writing-schema.rst and DT submitting-patches.rst to
     cover some frequent review comments

   - Clean-up SPDX tags to use 'OR' rather than 'or'"

* tag 'devicetree-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (56 commits)
  dt-bindings: soc: imx: fsl,imx-anatop: add imx6q regulators
  of: unittest: Use for_each_child_of_node_scoped()
  of: Introduce for_each_*_child_of_node_scoped() to automate of_node_put() handling
  of: Add cleanup.h based auto release via __free(device_node) markings
  of: Move all FDT reserved-memory handling into of_reserved_mem.c
  of: Add KUnit test to confirm DTB is loaded
  of: unittest: treat missing of_root as error instead of fixing up
  x86/of: Unconditionally call unflatten_and_copy_device_tree()
  um: Unconditionally call unflatten_device_tree()
  of: Create of_root if no dtb provided by firmware
  of: Always unflatten in unflatten_and_copy_device_tree()
  dt-bindings: timer: mediatek: Convert to json-schema
  dt-bindings: interrupt-controller: fsl,intmux: Include power-domains support
  soc: fsl: qbman: Remove RESERVEDMEM_OF_DECLARE usage
  dt-bindings: fsl-imx-sdma: fix HDMI audio index
  dt-bindings: soc: imx: fsl,imx-iomuxc-gpr: add imx6
  dt-bindings: soc: imx: fsl,imx-anatop: add binding
  dt-bindings: input: touchscreen: fsl,imx6ul-tsc convert to YAML
  dt-bindings: pinctrl: fsl,imx6ul-pinctrl: convert to YAML
  of: make for_each_property_of_node() available to to !OF
  ...
2024-03-15 12:37:59 -07:00
Linus Torvalds
902861e34c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory.  Series
   "implement "memmap on memory" feature on s390".
 
 - More folio conversions from Matthew Wilcox in the series
 
 	"Convert memcontrol charge moving to use folios"
 	"mm: convert mm counter to take a folio"
 
 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes.  The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".
 
 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.
 
 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools".  Measured improvements are modest.
 
 - zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
   zswap: simplify zswap_swapoff()".
 
 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is hotplugged
   as system memory.
 
 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.
 
 - More DAMON work from SeongJae Park in the series
 
 	"mm/damon: make DAMON debugfs interface deprecation unignorable"
 	"selftests/damon: add more tests for core functionalities and corner cases"
 	"Docs/mm/damon: misc readability improvements"
 	"mm/damon: let DAMOS feeds and tame/auto-tune itself"
 
 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving policy
   wherein we allocate memory across nodes in a weighted fashion rather
   than uniformly.  This is beneficial in heterogeneous memory environments
   appearing with CXL.
 
 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
 
 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".
 
 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format.  Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.
 
 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP".  Mainly
   targeted at arm64, this significantly speeds up fork() when the process
   has a large number of pte-mapped folios.
 
 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP".  It
   implements batching during munmap() and other pte teardown situations.
   The microbenchmark improvements are nice.
 
 - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
   Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings").  Kernel build times on arm64 improved nicely.  Ryan's series
   "Address some contpte nits" provides some followup work.
 
 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page faults.
   He has also added a reproducer under the selftest code.
 
 - In the series "selftests/mm: Output cleanups for the compaction test",
   Mark Brown did what the title claims.
 
 - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
 
 - Even more zswap material from Nhat Pham.  The series "fix and extend
   zswap kselftests" does as claimed.
 
 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
   our handling of DAX on archiecctures which have virtually aliasing data
   caches.  The arm architecture is the main beneficiary.
 
 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
   improvements in worst-case mmap_lock hold times during certain
   userfaultfd operations.
 
 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series
 
 	"page_owner: print stacks and their outstanding allocations"
 	"page_owner: Fixup and cleanup"
 
 - Uladzislau Rezki has contributed some vmalloc scalability improvements
   in his series "Mitigate a vmap lock contention".  It realizes a 12x
   improvement for a certain microbenchmark.
 
 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".
 
 - Some zsmalloc maintenance work from Chengming Zhou in the series
 
 	"mm/zsmalloc: fix and optimize objects/page migration"
 	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
 
 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0.  This a step along the path to implementaton of the merging of
   large anonymous folios.  The series is named "Enable >0 order folio
   memory compaction".
 
 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages() to
   an iterator".
 
 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".
 
 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios.  The
   series is named "Split a folio to any lower order folios".
 
 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.
 
 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".
 
 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which are
   configured to use large numbers of hugetlb pages.
 
 - Matthew Wilcox's series "PageFlags cleanups" does that.
 
 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also.  S390 is affected.
 
 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".
 
 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
 
 - Also, of course, many singleton patches to many things.  Please see
   the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
 joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
 TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
 =TG55
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
Linus Torvalds
01732755ee Probes updates for v6.9:
- x96/kprobes: Use boolean for some function return instead of 0 and 1.
  - x86/kprobes: Prohibit probing on INT/UD. This prevents user to put kprobe on
     INTn/INT1/INT3/INTO and UD0/UD1/UD2 because these are used for a special
     purpose in the kernel.
  - x86/kprobes: Boost Grp instructions. Because a few percent of kernel
     instructions are Grp 2/3/4/5 and those are safe to be executed without
     ip register fixup, allow those to be boosted (direct execution on the
     trampoline buffer with a JMP).
 
  - tracing/probes: Add function argument access from return events (kretprobe
     and fprobe). This allows user to compare how a data structure field is
     changed after executing a function. With BTF, return event also accepts
     function argument access by name. This also includes below patches;
   . Fix a wrong comment (using "Kretprobe" in fprobe)
   . Cleanup a big probe argument parser function into three parts, type
      parser, post-processing function, and main parser.
   . Cleanup to set nr_args field when initializing trace_probe instead of
      counting up it while parsing.
   . Cleanup a redundant #else block from tracefs/README source code.
   . Update selftests to check entry argument access from return probes.
   . Documentation update about entry argument access from return probes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmXwW4kbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bH80H/3H6JENlDAjaSLi4vYrP
 Qyw/cOGIuGu8cDEzkkOaFMol3TY23M7tQZH1lFefvV92gebZ0ttXnrQhSsKeO5XT
 PCZ6Eoift5rwJCY967W4V6O0DrAkOGHlPtlKs47APJnTXwn8RcFTqWlQmhWg1AfD
 g/FCWV7cs3eewZgV9iQcLydOoLLgRMr3G3rtPYQbCXhPzze0WTu4dSOXxCTjFe04
 riHQy7R+ut6Cur8njpoqZl6bCMkQqAylByXf6wK96HjcS0+ZI7Ivi8Ey3l2aAFen
 EeIViMU2Bl02XzBszj7Xq2cT/ebYAgDonFW3/5ZKD1YMO6F7wPoVH5OHrQ518Xuw
 hQ8=
 =O6l5
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:
 "x86 kprobes:

   - Use boolean for some function return instead of 0 and 1

   - Prohibit probing on INT/UD. This prevents user to put kprobe on
     INTn/INT1/INT3/INTO and UD0/UD1/UD2 because these are used for a
     special purpose in the kernel

   - Boost Grp instructions. Because a few percent of kernel
     instructions are Grp 2/3/4/5 and those are safe to be executed
     without ip register fixup, allow those to be boosted (direct
     execution on the trampoline buffer with a JMP)

  tracing:

   - Add function argument access from return events (kretprobe and
     fprobe). This allows user to compare how a data structure field is
     changed after executing a function. With BTF, return event also
     accepts function argument access by name.

   - Fix a wrong comment (using "Kretprobe" in fprobe)

   - Cleanup a big probe argument parser function into three parts, type
     parser, post-processing function, and main parser

   - Cleanup to set nr_args field when initializing trace_probe instead
     of counting up it while parsing

   - Cleanup a redundant #else block from tracefs/README source code

   - Update selftests to check entry argument access from return probes

   - Documentation update about entry argument access from return
     probes"

* tag 'probes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  Documentation: tracing: Add entry argument access at function exit
  selftests/ftrace: Add test cases for entry args at function exit
  tracing/probes: Support $argN in return probe (kprobe and fprobe)
  tracing: Remove redundant #else block for BTF args from README
  tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init
  tracing/probes: Cleanup probe argument parser
  tracing/fprobe-event: cleanup: Fix a wrong comment in fprobe event
  x86/kprobes: Boost more instructions from grp2/3/4/5
  x86/kprobes: Prohibit kprobing on INT and UD
  x86/kprobes: Refactor can_{probe,boost} return type to bool
2024-03-14 16:16:33 -07:00
Linus Torvalds
9434467959 ACPI updates for 6.9-rc1
- Rearrange Device Check and Bus Check notification handling in the
    ACPI device hotplug code to make it get the "enabled" _STA bit into
    account (Rafael Wysocki).
 
  - Modify acpi_processor_add() to skip processors with the "enabled"
    _STA bit clear, as per the specification (Rafael Wysocki).
 
  - Stop failing Device Check notification handling without a valid
    reason (Rafael Wysocki).
 
  - Defer enumeration of devices that depend on a device with an ACPI
    device ID equalt to INTC10CF to address probe ordering issues on
    some platforms (Wentong Wu).
 
  - Constify acpi_bus_type (Ricardo Marliere).
 
  - Make the ACPI-specific suspend-to-idle code take the Low-Power S0
    Idle MSFT UUID into account on non-AMD systems (Rafael Wysocki).
 
  - Add ACPI IRQ override quirks for some new platforms (Sergey
    Kalinichev, Maxim Kudinov, Alexey Froloff, Sviatoslav Harasymchuk,
    Nicolas Haye).
 
  - Make the NFIT parsing code use acpi_evaluate_dsm_typed() (Andy
    Shevchenko).
 
  - Fix a memory leak in acpi_processor_power_exit() (Armin Wolf).
 
  - Make it possible to quirk the CSI-2 and MIPI DisCo for Imaging
    properties parsing and add a quirk for Dell XPS 9315 (Sakari Ailus).
 
  - Prevent false-positive static checker warnings from triggering by
    intializing some variables in the ACPI thermal code to zero (Colin
    Ian King).
 
  - Add DELL0501 handling to acpi_quirk_skip_serdev_enumeration() and
    make that function generic (Hans de Goede).
 
  - Make the ACPI backlight code handle fetching EDID that is longer than
    256 bytes (Mario Limonciello).
 
  - Skip initialization of GHES_ASSIST structures for Machine Check
    Architecture in APEI (Avadhut Naik).
 
  - Convert several plaform drivers in the ACPI subsystem to using a
    remove callback that returns void (Uwe Kleine-König).
 
  - Drop the long-deprecated custom_method debugfs interface that is
    problematic from the security standpoint (Rafael Wysocki).
 
  - Use %pe in a couple of places in the ACPI code for easier error
    decoding (Onkarnath).
 
  - Fix register width information handling during system memory
    accesses in the ACPI CPPC library (Jarred White).
 
  - Add AMD CPPC V2 support for family 17h processors to the ACPI CPPC
    library (Perry Yuan).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmXvJFISHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxtWMP/0IySac8M0jkQJrlJeIqXHhV/akwpe5H
 kWtuT8bZ8iiQmB+orDW+hpLhdATc2vv1XPRApnkkd9QtrLEVBBIoLrDxzUJy/tzv
 0qgj2s0A1pc6gRpX1y3Wc94U3PnTzOodVpjZq6L9rGQ3enBkIU7H3CWs2LIq3VfB
 lUSY1dmVB2rv1MHfsoxTM6eiRYEZSAc3v8b0jpIjfaHsxdUPsqZLMPXTVzgGWfu4
 ePCN1oDKX9xQkVV9/hnxBYwTSO+ySBq1jgtG2PaqRmJXIaZR/24A1tHTr2UQvTss
 x39WnjWIrKOwO6TfwoX8KlsdBaJeQo0biH42QQV08UIYUWfQmoUVVffZMPGrlGvV
 F3vPMLZADMROJhtfoD4hXIcj+Asa/uqi6lKN1mctb0akozOGlHbX4yNXq4MZS9Hj
 sXO9gMXgWjh5cnC/NSekcdVbLbARj2g7JWdpq1dZmgh9eaXP07/D6DcrVbcdZSHs
 ySb6DNNAEXPL0PgSU+cLiwRRH43C+ilVz/OMrdHxb4jSuAVDluD4hwi1IwwB4feG
 k0s9i0OQKAkC/9UXcJrlTFs1fE4ftZ0gYVZDiSeDDy9FUo1ZYCRhOP7yfrjoCHhH
 Wc7sllUKHsy1TPi3Wh3ANxUaZhviNn59rL3JPAKeX1Qjx2LB+qHS6j08/v/F3m6W
 Srp8VJsItb/D
 =Yngi
 -----END PGP SIGNATURE-----

Merge tag 'acpi-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI updates from Rafael Wysocki:
 "These modify the ACPI device events and processor enumeration code to
  take the 'enabled' _STA bit into account as mandated by the ACPI
  specification, convert several platform drivers to using a remove
  callback that returns void, add some new quirks for ACPI IRQ override
  and other things, address assorted issues and clean up code.

  Specifics:

   - Rearrange Device Check and Bus Check notification handling in the
     ACPI device hotplug code to make it get the "enabled" _STA bit into
     account (Rafael Wysocki)

   - Modify acpi_processor_add() to skip processors with the "enabled"
     _STA bit clear, as per the specification (Rafael Wysocki)

   - Stop failing Device Check notification handling without a valid
     reason (Rafael Wysocki)

   - Defer enumeration of devices that depend on a device with an ACPI
     device ID equalt to INTC10CF to address probe ordering issues on
     some platforms (Wentong Wu)

   - Constify acpi_bus_type (Ricardo Marliere)

   - Make the ACPI-specific suspend-to-idle code take the Low-Power S0
     Idle MSFT UUID into account on non-AMD systems (Rafael Wysocki)

   - Add ACPI IRQ override quirks for some new platforms (Sergey
     Kalinichev, Maxim Kudinov, Alexey Froloff, Sviatoslav Harasymchuk,
     Nicolas Haye)

   - Make the NFIT parsing code use acpi_evaluate_dsm_typed() (Andy
     Shevchenko)

   - Fix a memory leak in acpi_processor_power_exit() (Armin Wolf)

   - Make it possible to quirk the CSI-2 and MIPI DisCo for Imaging
     properties parsing and add a quirk for Dell XPS 9315 (Sakari Ailus)

   - Prevent false-positive static checker warnings from triggering by
     intializing some variables in the ACPI thermal code to zero (Colin
     Ian King)

   - Add DELL0501 handling to acpi_quirk_skip_serdev_enumeration() and
     make that function generic (Hans de Goede)

   - Make the ACPI backlight code handle fetching EDID that is longer
     than 256 bytes (Mario Limonciello)

   - Skip initialization of GHES_ASSIST structures for Machine Check
     Architecture in APEI (Avadhut Naik)

   - Convert several plaform drivers in the ACPI subsystem to using a
     remove callback that returns void (Uwe Kleine-König)

   - Drop the long-deprecated custom_method debugfs interface that is
     problematic from the security standpoint (Rafael Wysocki)

   - Use %pe in a couple of places in the ACPI code for easier error
     decoding (Onkarnath)

   - Fix register width information handling during system memory
     accesses in the ACPI CPPC library (Jarred White)

   - Add AMD CPPC V2 support for family 17h processors to the ACPI CPPC
     library (Perry Yuan)"

* tag 'acpi-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
  ACPI: resource: Use IRQ override on Maibenben X565
  ACPI: CPPC: Use access_width over bit_width for system memory accesses
  ACPI: CPPC: enable AMD CPPC V2 support for family 17h processors
  ACPI: APEI: Skip initialization of GHES_ASSIST structures for Machine Check Architecture
  ACPI: scan: Consolidate Device Check and Bus Check notification handling
  ACPI: scan: Rework Device Check and Bus Check notification handling
  ACPI: scan: Make acpi_processor_add() check the device enabled bit
  ACPI: scan: Relocate acpi_bus_trim_one()
  ACPI: scan: Fix device check notification handling
  ACPI: resource: Add MAIBENBEN X577 to irq1_edge_low_force_override
  ACPI: pfr_update: Convert to platform remove callback returning void
  ACPI: pfr_telemetry: Convert to platform remove callback returning void
  ACPI: fan: Convert to platform remove callback returning void
  ACPI: GED: Convert to platform remove callback returning void
  ACPI: DPTF: Convert to platform remove callback returning void
  ACPI: AGDI: Convert to platform remove callback returning void
  ACPI: TAD: Convert to platform remove callback returning void
  ACPI: APEI: GHES: Convert to platform remove callback returning void
  ACPI: property: Polish ignoring bad data nodes
  ACPI: thermal_lib: Initialize temp_decik to zero
  ...
2024-03-13 11:54:05 -07:00
Linus Torvalds
07abb19a9b Power management updates for 6.9-rc1
- Allow the Energy Model to be updated dynamically (Lukasz Luba).
 
  - Add support for LZ4 compression algorithm to the hibernation image
    creation and loading code (Nikhil V).
 
  - Fix and clean up system suspend statistics collection (Rafael
    Wysocki).
 
  - Simplify device suspend and resume handling in the power management
    core code (Rafael Wysocki).
 
  - Fix PCI hibernation support description (Yiwei Lin).
 
  - Make hibernation take set_memory_ro() return values into account as
    appropriate (Christophe Leroy).
 
  - Set mem_sleep_current during kernel command line setup to avoid an
    ordering issue with handling it (Maulik Shah).
 
  - Fix wake IRQs handling when pm_runtime_force_suspend() is used as a
    driver's system suspend callback (Qingliang Li).
 
  - Simplify pm_runtime_get_if_active() usage and add a replacement for
    pm_runtime_put_autosuspend() (Sakari Ailus).
 
  - Add a tracepoint for runtime_status changes tracking (Vilas Bhat).
 
  - Fix section title markdown in the runtime PM documentation (Yiwei
    Lin).
 
  - Enable preferred core support in the amd-pstate cpufreq driver (Meng
    Li).
 
  - Fix min_perf assignment in amd_pstate_adjust_perf() and make the
    min/max limit perf values in amd-pstate always stay within the
    (highest perf, lowest perf) range (Tor Vic, Meng Li).
 
  - Allow intel_pstate to assign model-specific values to strings used in
    the EPP sysfs interface and make it do so on Meteor Lake (Srinivas
    Pandruvada).
 
  - Drop long-unused cpudata::prev_cummulative_iowait from the
    intel_pstate cpufreq driver (Jiri Slaby).
 
  - Prevent scaling_cur_freq from exceeding scaling_max_freq when the
    latter is an inefficient frequency (Shivnandan Kumar).
 
  - Change default transition delay in cpufreq to 2ms (Qais Yousef).
 
  - Remove references to 10ms minimum sampling rate from comments in the
    cpufreq code (Pierre Gondois).
 
  - Honour transition_latency over transition_delay_us in cpufreq (Qais
    Yousef).
 
  - Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar).
 
  - General enhancements / cleanups to ARM cpufreq drivers (tianyu2,
    Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia
    Belova).
 
  - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan).
 
  - Make the SCMI cpufreq driver get a transition delay value from
    firmware (Pierre Gondois).
 
  - Prevent the haltpoll cpuidle governor from shrinking guest
    poll_limit_ns below grow_start (Parshuram Sangle).
 
  - Avoid potential overflow in integer multiplication when computing
    cpuidle state parameters (C Cheng).
 
  - Adjust MWAIT hint target C-state computation in the ACPI cpuidle
    driver and in intel_idle to return a correct value for C0 (He
    Rongguang).
 
  - Address multiple issues in the TPMI RAPL driver and add support for
    new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui).
 
  - Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel
    Lezcano).
 
  - Fix kernel-doc for dtpm_create_hierarchy() (Yang Li).
 
  - Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth
    Norway Ananda).
 
  - Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil).
 
  - Fix a couple of warnings in the OPP core code related to W=1
    builds (Viresh Kumar).
 
  - Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh
    Kumar).
 
  - Extend dev_pm_opp_data with turbo support (Sibi Sankar).
 
  - dt-bindings: drop maxItems from inner items (David Heidelberg).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmXvI/ISHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx24sP/jxg6fOGme8raHQvpTXG3/H56wlGzQ4P
 YUvvKUXnfD3yf1zNISsUl7VQebZqDt8rygkwSdymXlUVZX1eubN0RpCFc0F8GZuc
 THG/YQhYQr/9zro3FpKhfDj5evk21PCQzjf+dGvfQF9qVMxNPG1JzEFK6PnolT5X
 2BvkonY1XFWZjCMbZ83B/jt35lTDb0cmeNbCpfD5UJgcnxmMOtZYpORdyfPWTJpG
 GVCwmAFVVXxXlust/AIpt3mmOpKzSA9GnrtJkhtQe5GN+Y4OjnJiFJmTC7EfCctj
 JlWgVUA716mtFMUrjXgjfI54firF2oQpqaSa2HG/V/A96JWQqjarGz5dAV1IrPEt
 ZmYpvMe4E90S411wF1OWyrEqjXUuDnH1OWUvUdWSt4E7DhFw3esDi/jLW2tyVKAT
 hIy+/O4wzbDSTX/h9Cgt1Qjhew6lKUIwvhEXclB3fuJ+JoviWNkC9lnK93e2H0A3
 VYfkd/lpUD74035l0FrCJ/49MjX9kqrsn+TipHsIlSXAi8ZRdKbVvxOTD8RYudcI
 GvCiDDrkMgNwGlyedgbtTBUepCvSg93b+vVmRj7YMPtBhioOUo3qCn6wpqhxfnth
 9BCnPW7JxqUw/NJdlk9hKumaUZq+MK8G+kdYcIDg6xmAkWSUVP2QKlWavfMCxqRP
 +dN6T2iHsKFe
 =UePT
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "From the functional perspective, the most significant change here is
  the addition of support for Energy Models that can be updated
  dynamically at run time.

  There is also the addition of LZ4 compression support for hibernation,
  the new preferred core support in amd-pstate, new platforms support in
  the Intel RAPL driver, new model-specific EPP handling in intel_pstate
  and more.

  Apart from that, the cpufreq default transition delay is reduced from
  10 ms to 2 ms (along with some related adjustments), the system
  suspend statistics code undergoes a significant rework and there is a
  usual bunch of fixes and code cleanups all over.

  Specifics:

   - Allow the Energy Model to be updated dynamically (Lukasz Luba)

   - Add support for LZ4 compression algorithm to the hibernation image
     creation and loading code (Nikhil V)

   - Fix and clean up system suspend statistics collection (Rafael
     Wysocki)

   - Simplify device suspend and resume handling in the power management
     core code (Rafael Wysocki)

   - Fix PCI hibernation support description (Yiwei Lin)

   - Make hibernation take set_memory_ro() return values into account as
     appropriate (Christophe Leroy)

   - Set mem_sleep_current during kernel command line setup to avoid an
     ordering issue with handling it (Maulik Shah)

   - Fix wake IRQs handling when pm_runtime_force_suspend() is used as a
     driver's system suspend callback (Qingliang Li)

   - Simplify pm_runtime_get_if_active() usage and add a replacement for
     pm_runtime_put_autosuspend() (Sakari Ailus)

   - Add a tracepoint for runtime_status changes tracking (Vilas Bhat)

   - Fix section title markdown in the runtime PM documentation (Yiwei
     Lin)

   - Enable preferred core support in the amd-pstate cpufreq driver
     (Meng Li)

   - Fix min_perf assignment in amd_pstate_adjust_perf() and make the
     min/max limit perf values in amd-pstate always stay within the
     (highest perf, lowest perf) range (Tor Vic, Meng Li)

   - Allow intel_pstate to assign model-specific values to strings used
     in the EPP sysfs interface and make it do so on Meteor Lake
     (Srinivas Pandruvada)

   - Drop long-unused cpudata::prev_cummulative_iowait from the
     intel_pstate cpufreq driver (Jiri Slaby)

   - Prevent scaling_cur_freq from exceeding scaling_max_freq when the
     latter is an inefficient frequency (Shivnandan Kumar)

   - Change default transition delay in cpufreq to 2ms (Qais Yousef)

   - Remove references to 10ms minimum sampling rate from comments in
     the cpufreq code (Pierre Gondois)

   - Honour transition_latency over transition_delay_us in cpufreq (Qais
     Yousef)

   - Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar)

   - General enhancements / cleanups to ARM cpufreq drivers (tianyu2,
     Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia
     Belova)

   - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan)

   - Make the SCMI cpufreq driver get a transition delay value from
     firmware (Pierre Gondois)

   - Prevent the haltpoll cpuidle governor from shrinking guest
     poll_limit_ns below grow_start (Parshuram Sangle)

   - Avoid potential overflow in integer multiplication when computing
     cpuidle state parameters (C Cheng)

   - Adjust MWAIT hint target C-state computation in the ACPI cpuidle
     driver and in intel_idle to return a correct value for C0 (He
     Rongguang)

   - Address multiple issues in the TPMI RAPL driver and add support for
     new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui)

   - Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel
     Lezcano)

   - Fix kernel-doc for dtpm_create_hierarchy() (Yang Li)

   - Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth
     Norway Ananda)

   - Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil)

   - Fix a couple of warnings in the OPP core code related to W=1 builds
     (Viresh Kumar)

   - Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh
     Kumar)

   - Extend dev_pm_opp_data with turbo support (Sibi Sankar)

   - dt-bindings: drop maxItems from inner items (David Heidelberg)"

* tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (95 commits)
  dt-bindings: opp: drop maxItems from inner items
  OPP: debugfs: Fix warning around icc_get_name()
  OPP: debugfs: Fix warning with W=1 builds
  cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h
  OPP: Extend dev_pm_opp_data with turbo support
  Fix cpupower-frequency-info.1 man page typo
  cpufreq: scmi: Set transition_delay_us
  firmware: arm_scmi: Populate fast channel rate_limit
  firmware: arm_scmi: Populate perf commands rate_limit
  cpuidle: ACPI/intel: fix MWAIT hint target C-state computation
  PM: sleep: wakeirq: fix wake irq warning in system suspend
  powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function
  cpufreq: Don't unregister cpufreq cooling on CPU hotplug
  PM: suspend: Set mem_sleep_current during kernel command line setup
  cpufreq: Honour transition_latency over transition_delay_us
  cpufreq: Limit resolving a frequency to policy min/max
  Documentation: PM: Fix runtime_pm.rst markdown syntax
  cpufreq: amd-pstate: adjust min/max limit perf
  cpufreq: Remove references to 10ms min sampling rate
  cpufreq: intel_pstate: Update default EPPs for Meteor Lake
  ...
2024-03-13 11:40:06 -07:00
Wei Yang
9b67ce2c12 x86/vmlinux.lds.S: Take __START_KERNEL out conditional definition
If CONFIG_X86_32=y, the section start address is defined to be
"LOAD_OFFSET + LOAD_PHYSICAL_ADDR", which is the same as
__START_KERNEL_map.

Unify it with the 64-bit definition to simplify the code.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240313075839.8321-5-richard.weiyang@gmail.com
2024-03-13 11:29:11 +01:00
Wei Yang
a5cffd056e x86/vmlinux.lds.S: Remove conditional definition of LOAD_OFFSET
In vmlinux.lds.S, we define LOAD_OFFSET conditionally to __PAGE_OFFSET
or __START_KERNEL_map. While __START_KERNEL_map is already defined to
the same value with the same condition.

So it is fine to define LOAD_OFFSET to __START_KERNEL_map directly.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240313075839.8321-4-richard.weiyang@gmail.com
2024-03-13 11:29:10 +01:00
Linus Torvalds
b29f377119 x86/boot changes for v6.9:
- Continuing work by Ard Biesheuvel to improve the x86 early startup code,
    with the long-term goal to make it position independent:
 
       - Get rid of early accesses to global objects, either by moving them
         to the stack, deferring the access until later, or dropping the
         globals entirely.
 
       - Move all code that runs early via the 1:1 mapping into .head.text,
         and move code that does not out of it, so that build time checks can
         be added later to ensure that no inadvertent absolute references were
         emitted into code that does not tolerate them.
 
       - Remove fixup_pointer() and occurrences of __pa_symbol(), which rely
         on the compiler emitting absolute references, which is not guaranteed.
 
  - Improve the early console code.
 
  - Add early console message about ignored NMIs, so that users are at least
    warned about their existence - even if we cannot do anything about them.
 
  - Improve the kexec code's kernel load address handling.
 
  - Enable more X86S (simplified x86) bits.
 
  - Simplify early boot GDT handling
 
  - Micro-optimize the boot code a bit
 
  - Misc cleanups.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXwIg8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jVHg//bzqXyzhoppEP4QMPVEHQdhy3UN33djwF
 HsjNgw/V1P5O5CPvQehCOgrJOcQ8LLPSA68ugG7FY9mzBjvnGnINXzWzukaaQGTh
 EXIwz/uw2++m3JMDt2PAzfeNZ8LlHb8V2xgexfkBFE7O3BX6ThIg9BKaFH1n7XOY
 AQXRRxlB5YThS3Rcqqeo/jN9bQZn7crqeWVS5Dk0bL1f53Y8SJjKIA4mHUb4xjbo
 LX0Z61G9Qz5e26U1U89tloW82zmiD/pvvuIQUnVVtPVMhSoFKhrxYI9MTPLjj0vt
 p+5UwMutFdJyjbTIsito7YSE6OG6RA2d1uoQjTQCx0sr6NtABbDE5QrciQTfHRGa
 1TyScbineiCf3GtQMuDRAKTbaUzWlUzmk9SrpUxK8UR+R6xVvA4GElUUvGe0/dKh
 QnYD+i6wr71S80t3gHqbBGcs4xjUS5rmpTXJ86VPp9hHB+l/2tvBnNro1JNxM/Ei
 wchQLHbaeWwztnceaGOWlsfAln0prtIYvVOUeTbn6rUFTjgSE2kS2h6GD3h3ZVnM
 az5G+bhjWm6eDL6QoBN6XsZ1UF0O7hcjOa2UpS8N1ek0b4E/LVwtMnmpexM09ehE
 FoBAsxYy5SuGCYab636rMmAmHwRjDozwNNJG+6RrrAYwqoQDqKiSnIismJwcOEKD
 6UzK/KBwxuI=
 =zvw3
 -----END PGP SIGNATURE-----

Merge tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot updates from Ingo Molnar:

 - Continuing work by Ard Biesheuvel to improve the x86 early startup
   code, with the long-term goal to make it position independent:

      - Get rid of early accesses to global objects, either by moving
        them to the stack, deferring the access until later, or dropping
        the globals entirely

      - Move all code that runs early via the 1:1 mapping into
        .head.text, and move code that does not out of it, so that build
        time checks can be added later to ensure that no inadvertent
        absolute references were emitted into code that does not
        tolerate them

      - Remove fixup_pointer() and occurrences of __pa_symbol(), which
        rely on the compiler emitting absolute references, which is not
        guaranteed

 - Improve the early console code

 - Add early console message about ignored NMIs, so that users are at
   least warned about their existence - even if we cannot do anything
   about them

 - Improve the kexec code's kernel load address handling

 - Enable more X86S (simplified x86) bits

 - Simplify early boot GDT handling

 - Micro-optimize the boot code a bit

 - Misc cleanups

* tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86/sev: Move early startup code into .head.text section
  x86/sme: Move early SME kernel encryption handling into .head.text
  x86/boot: Move mem_encrypt= parsing to the decompressor
  efi/libstub: Add generic support for parsing mem_encrypt=
  x86/startup_64: Simplify virtual switch on primary boot
  x86/startup_64: Simplify calculation of initial page table address
  x86/startup_64: Defer assignment of 5-level paging global variables
  x86/startup_64: Simplify CR4 handling in startup code
  x86/boot: Use 32-bit XOR to clear registers
  efi/x86: Set the PE/COFF header's NX compat flag unconditionally
  x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]
  x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]
  x86/boot/64: Use RIP_REL_REF() to access early page tables
  x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'
  x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]
  x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'
  x86/boot/64: Simplify global variable accesses in GDT/IDT programming
  x86/trampoline: Bypass compat mode in trampoline_start64() if not needed
  kexec: Allocate kernel above bzImage's pref_address
  x86/boot: Add a message about ignored early NMIs
  ...
2024-03-12 09:58:57 -07:00
Linus Torvalds
0e33cf955f * Mitigate RFDS vulnerability
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmXvZgoACgkQaDWVMHDJ
 krC2Eg//aZKBp97/DSzRqXKDwJzVUr0sGJ9cii0gVT1sI+1U6ZZCh/roVH4xOT5/
 HqtOOnQ+X0mwUx2VG3Yv2VPI7VW68sJ3/y9D8R4tnMEsyQ4CmDw96Pre3NyKr/Av
 jmW7SK94fOkpNFJOMk3zpk7GtRUlCsVkS1P61dOmMYduguhel/V20rWlx83BgnAY
 Rf/c3rBjqe8Ri3rzBP5icY/d6OgwoafuhME31DD/j6oKOh+EoQBvA4urj46yMTMX
 /mrK7hCm/wqwuOOvgGbo7sfZNBLCYy3SZ3EyF4beDERhPF1DaSvCwOULpGVJroqu
 SelFsKXAtEbYrDgsan+MYlx3bQv43q7PbHska1gjkH91plO4nAsssPr5VsusUKmT
 sq8jyBaauZb40oLOSgooL4RqAHrfs8q5695Ouwh/DB/XovMezUI1N/BkpGFmqpJI
 o2xH9P5q520pkB8pFhN9TbRuFSGe/dbWC24QTq1DUajo3M3RwcwX6ua9hoAKLtDF
 pCV5DNcVcXHD3Cxp0M5dQ5JEAiCnW+ZpUWgxPQamGDNW5PEvjDmFwql2uWw/qOuW
 lkheOIffq8ejUBQFbN8VXfIzzeeKQNFiIcViaqGITjIwhqdHAzVi28OuIGwtdh3g
 ywLzSC8yvyzgKrNBgtFMr3ucKN0FoPxpBro253xt2H7w8srXW64=
 =5V9t
 -----END PGP SIGNATURE-----

Merge tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 RFDS mitigation from Dave Hansen:
 "RFDS is a CPU vulnerability that may allow a malicious userspace to
  infer stale register values from kernel space. Kernel registers can
  have all kinds of secrets in them so the mitigation is basically to
  wait until the kernel is about to return to userspace and has user
  values in the registers. At that point there is little chance of
  kernel secrets ending up in the registers and the microarchitectural
  state can be cleared.

  This leverages some recent robustness fixes for the existing MDS
  vulnerability. Both MDS and RFDS use the VERW instruction for
  mitigation"

* tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
  x86/rfds: Mitigate Register File Data Sampling (RFDS)
  Documentation/hw-vuln: Add documentation for RFDS
  x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
2024-03-12 09:31:39 -07:00
Ingo Molnar
2e2bc42c83 Merge branch 'linus' into x86/boot, to resolve conflict
There's a new conflict with Linus's upstream tree, because
in the following merge conflict resolution in <asm/coco.h>:

  38b334fc76 Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Linus has resolved the conflicting placement of 'cc_mask' better
than the original commit:

  1c811d403a x86/sev: Fix position dependent variable references in startup code

... which was also done by an internal merge resolution:

  2e5fc4786b Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree

But Linus is right in 38b334fc76, the 'cc_mask' declaration is sufficient
within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block.

So instead of forcing Linus to do the same resolution again, merge in Linus's
tree and follow his conflict resolution.

 Conflicts:
	arch/x86/include/asm/coco.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-03-12 09:55:57 +01:00
Nuno Das Neves
410779d8d8 mshyperv: Introduce hv_get_hypervisor_version function
Introduce x86_64 and arm64 functions to get the hypervisor version
information and store it in a structure for simpler parsing.

Use the new function to get and parse the version at boot time. While at
it, move the printing code to hv_common_init() so it is not duplicated.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com>
2024-03-12 05:17:50 +00:00
Linus Torvalds
685d982112 Core x86 changes for v6.9:
- The biggest change is the rework of the percpu code,
   to support the 'Named Address Spaces' GCC feature,
   by Uros Bizjak:
 
    - This allows C code to access GS and FS segment relative
      memory via variables declared with such attributes,
      which allows the compiler to better optimize those accesses
      than the previous inline assembly code.
 
    - The series also includes a number of micro-optimizations
      for various percpu access methods, plus a number of
      cleanups of %gs accesses in assembly code.
 
    - These changes have been exposed to linux-next testing for
      the last ~5 months, with no known regressions in this area.
 
 - Fix/clean up __switch_to()'s broken but accidentally
   working handling of FPU switching - which also generates
   better code.
 
 - Propagate more RIP-relative addressing in assembly code,
   to generate slightly better code.
 
 - Rework the CPU mitigations Kconfig space to be less idiosyncratic,
   to make it easier for distros to follow & maintain these options.
 
 - Rework the x86 idle code to cure RCU violations and
   to clean up the logic.
 
 - Clean up the vDSO Makefile logic.
 
 - Misc cleanups and fixes.
 
 [ Please note that there's a higher number of merge commits in
   this branch (three) than is usual in x86 topic trees. This happened
   due to the long testing lifecycle of the percpu changes that
   involved 3 merge windows, which generated a longer history
   and various interactions with other core x86 changes that we
   felt better about to carry in a single branch. ]
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXvB0gRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jUqRAAqnEQPiabF5acQlHrwviX+cjSobDlqtH5
 9q2AQy9qaEHapzD0XMOxvFye6XIvehGOGxSPvk6CoviSxBND8rb56lvnsEZuLeBV
 Bo5QSIL2x42Zrvo11iPHwgXZfTIusU90sBuKDRFkYBAxY3HK2naMDZe8MAsYCUE9
 nwgHF8DDc/NYiSOXV8kosWoWpNIkoK/STyH5bvTQZMqZcwyZ49AIeP1jGZb/prbC
 e/rbnlrq5Eu6brpM7xo9kELO0Vhd34urV14KrrIpdkmUKytW2KIsyvW8D6fqgDBj
 NSaQLLcz0pCXbhF+8Nqvdh/1coR4L7Ymt08P1rfEjCsQgb/2WnSAGUQuC5JoGzaj
 ngkbFcZllIbD9gNzMQ1n4Aw5TiO+l9zxCqPC/r58Uuvstr+K9QKlwnp2+B3Q73Ft
 rojIJ04NJL6lCHdDgwAjTTks+TD2PT/eBWsDfJ/1pnUWttmv9IjMpnXD5sbHxoiU
 2RGGKnYbxXczYdq/ALYDWM6JXpfnJZcXL3jJi0IDcCSsb92xRvTANYFHnTfyzGfw
 EHkhbF4e4Vy9f6QOkSP3CvW5H26BmZS9DKG0J9Il5R3u2lKdfbb5vmtUmVTqHmAD
 Ulo5cWZjEznlWCAYSI/aIidmBsp9OAEvYd+X7Z5SBIgTfSqV7VWHGt0BfA1heiVv
 F/mednG0gGc=
 =3v4F
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core x86 updates from Ingo Molnar:

 - The biggest change is the rework of the percpu code, to support the
   'Named Address Spaces' GCC feature, by Uros Bizjak:

      - This allows C code to access GS and FS segment relative memory
        via variables declared with such attributes, which allows the
        compiler to better optimize those accesses than the previous
        inline assembly code.

      - The series also includes a number of micro-optimizations for
        various percpu access methods, plus a number of cleanups of %gs
        accesses in assembly code.

      - These changes have been exposed to linux-next testing for the
        last ~5 months, with no known regressions in this area.

 - Fix/clean up __switch_to()'s broken but accidentally working handling
   of FPU switching - which also generates better code

 - Propagate more RIP-relative addressing in assembly code, to generate
   slightly better code

 - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
   make it easier for distros to follow & maintain these options

 - Rework the x86 idle code to cure RCU violations and to clean up the
   logic

 - Clean up the vDSO Makefile logic

 - Misc cleanups and fixes

* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  x86/idle: Select idle routine only once
  x86/idle: Let prefer_mwait_c1_over_halt() return bool
  x86/idle: Cleanup idle_setup()
  x86/idle: Clean up idle selection
  x86/idle: Sanitize X86_BUG_AMD_E400 handling
  sched/idle: Conditionally handle tick broadcast in default_idle_call()
  x86: Increase brk randomness entropy for 64-bit systems
  x86/vdso: Move vDSO to mmap region
  x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
  x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
  x86/retpoline: Ensure default return thunk isn't used at runtime
  x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
  x86/vdso: Use $(addprefix ) instead of $(foreach )
  x86/vdso: Simplify obj-y addition
  x86/vdso: Consolidate targets and clean-files
  x86/bugs: Rename CONFIG_RETHUNK              => CONFIG_MITIGATION_RETHUNK
  x86/bugs: Rename CONFIG_CPU_SRSO             => CONFIG_MITIGATION_SRSO
  x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY       => CONFIG_MITIGATION_IBRS_ENTRY
  x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY      => CONFIG_MITIGATION_UNRET_ENTRY
  x86/bugs: Rename CONFIG_SLS                  => CONFIG_MITIGATION_SLS
  ...
2024-03-11 19:53:15 -07:00
Linus Torvalds
fcc196579a Misc cleanups, including a large series from Thomas Gleixner to
cure Sparse warnings.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXvAFQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hkDRAAwASVCQ88kiGqNQtHibXlK54mAFGsc0xv
 T8OPds15DUzoLg/y8lw0X0DHly6MdGXVmygybejNIw2BN4lhLjQ7f4Ria7rv7LDy
 FcI1jfvysEMyYRFHGRefb/GBFzuEfKoROwf+QylGmKz0ZK674gNMngsI9pwOBdbe
 wElq3IkHoNuTUfH9QA4BvqGam1n122nvVTop3g0PMHWzx9ky8hd/BEUjXFZhfINL
 zZk3fwUbER2QYbhHt+BN2GRbdf2BrKvqTkXpKxyXTdnpiqAo0CzBGKerZ62H82qG
 n737Nib1lrsfM5yDHySnau02aamRXaGvCJUd6gpac1ZmNpZMWhEOT/0Tr/Nj5ztF
 lUAvKqMZn/CwwQky1/XxD0LHegnve0G+syqQt/7x7o1ELdiwTzOWMCx016UeodzB
 yyHf3Xx9J8nt3snlrlZBaGEfegg9ePLu5Vir7iXjg3vrloUW8A+GZM62NVxF4HVV
 QWF80BfWf8zbLQ/OS1382t1shaioIe5pEXzIjcnyVIZCiiP2/5kP2O6P4XVbwVlo
 Ca5eEt8U1rtsLUZaCzI2ZRTQf/8SLMQWyaV+ZmkVwcVdFoARC31EgdE5wYYoZOf6
 7Vl+rXd+rZCuTWk0ZgznCZEm75aaqukaQCBa2V8hIVociLFVzhg/Tjedv7s0CspA
 hNfxdN1LDZc=
 =0eJ7
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups, including a large series from Thomas Gleixner to cure
  sparse warnings"

* tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi: Drop unused declaration of proc_nmi_enabled()
  x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables
  x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
  x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current
  x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address()
  x86/percpu: Cure per CPU madness on UP
  smp: Consolidate smp_prepare_boot_cpu()
  x86/msr: Add missing __percpu annotations
  x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
  perf/x86/amd/uncore: Fix __percpu annotation
  x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP)
  x86/apm_32: Remove dead function apm_get_battery_status()
  x86/insn-eval: Fix function param name in get_eff_addr_sib()
2024-03-11 19:37:56 -07:00
Linus Torvalds
d69ad12c78 x86/build changes for v6.9:
- Reduce <asm/bootparam.h> dependencies
 - Simplify <asm/efi.h>
 - Unify *_setup_data definitions into <asm/setup_data.h>
 - Reduce the size of <asm/bootparam.h>
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXu+VERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jQCxAAiESAaRnUY3IzENu502LHWdUUihbgCUdp
 zNE5GDX4+FCt4w7DXUGbkoRchsrZEISR4LeEmuQ29wkvclPOhr9LlI3uNpM4l/E+
 e52B8/ig6Yd+D3g7FL7ck+OnTjEQ+V/SifR/5YGKr5TownLoCJXBlitaZsShvVcT
 70+NN/BiJC/n3D8/CYzFUYB6uj3YjZYidFb0dTyJOCVEJxe5m0NCQAtk3bMovwpl
 xmvqVs++VFCEYdcTxK40XBlbcP6KF5DZFVvGw9/vKdU6TKsXwCkrh7GCiFXOJ8bj
 vEHuFAx9tspAaAAnVCQCp42RLbjldvSqGCmif/iswN8JLwAd1FwWf0VXQJaf1qtZ
 XDB+KBRDIrM+arD9qrZb6ghYkenovq8yyEwXETHq79h7ICpCAqm9XE2PQKP/IJZ6
 7A1zdXnHaa/VJEKUZg7Jg9E9c1BsqXCGrOUpLIuEnks//nNgU68JbsRr+9LF9UnB
 LEPQBUuAwPR8cb+JVmN7NNOJpCrjIikx2yKU+BJ5ywCZ5qKs7VA6IxbPLvtBVEv7
 eokYFHJb4Wzgauxxisy6KaaLJc+hIz680bMfjMBFnZ95cgh7ZYTMxO0G0eozAVNX
 BzOQTfPocLBWJ4qiyMnItvWKE1ioUjcWneq46Y+njD5Ow66H/Y/uOmPa3dBj9AxD
 aGkMg3ceTy0=
 =leh5
 -----END PGP SIGNATURE-----

Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 build updates from Ingo Molnar:

 - Reduce <asm/bootparam.h> dependencies

 - Simplify <asm/efi.h>

 - Unify *_setup_data definitions into <asm/setup_data.h>

 - Reduce the size of <asm/bootparam.h>

* tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Do not include <asm/bootparam.h> in several files
  x86/efi: Implement arch_ima_efi_boot_mode() in source file
  x86/setup: Move internal setup_data structures into setup_data.h
  x86/setup: Move UAPI setup structures into setup_data.h
2024-03-11 19:23:16 -07:00