mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
0d60fd5032
743 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
9132d720eb |
x86/apic: Wrap APIC ID validation into an inline
Prepare for removing the callback and making this as simple comparison to an upper limit, which is the obvious solution to do for limit checks... Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest) |
||
![]() |
79c9a17c16 |
x86/apic/32: Decrapify the def_bigsmp mechanism
If the system has more than 8 CPUs then XAPIC and the bigsmp APIC driver is required. This is ensured via: 1) Enumerating all possible CPUs up to NR_CPUS 2) Checking at boot CPU APIC setup time whether the system has more than 8 CPUs and has an XAPIC. If that's the case then it's attempted to install the bigsmp APIC driver and a magic variable 'def_to_bigsmp' is set to one. 3) If that magic variable is set and CONFIG_X86_BIGSMP=n and the system has more than 8 CPUs smp_sanity_check() removes all CPUs >= #8 from the present and possible mask in the most convoluted way. This logic is completely broken for the case where the bigsmp driver is enabled, but not selected due to a command line option specifying the default APIC. In that case the system boots with default APIC in logical destination mode and fails to reduce the number of CPUs. That aside the above which is sprinkled over 3 different places is yet another piece of art. It would have been too obvious to check the requirements upfront and limit nr_cpu_ids _before_ enumerating tons of CPUs and then removing them again. Implement exactly this. Check the bigsmp requirement when the boot APIC is registered which happens _before_ ACPI/MPTABLE parsing and limit the number of CPUs to 8 if it can't be used. Switch it over when the boot CPU apic is set up if necessary. [ dhansen: fix nr_cpu_ids off-by-one in default_setup_apic_routing() ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest) |
||
![]() |
55cc40d3df |
x86/apic: Nuke another processor check
The boot CPUs local APIC is now always registered, so there is no point to have another unreadable validatation for it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest) |
||
![]() |
e8122513ff |
x86/apic: Sanitize num_processors handling
num_processors is 0 by default and only gets incremented when local APICs are registered. Make init_apic_mappings(), which tries to enable the local APIC in the case that no SMP configuration was found set num_processors to 1. This allows to remove yet another check for the local APIC and yet another place which registers the boot CPUs local APIC ID. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest) |
||
![]() |
249ada2c82 |
x86/apic: Remove the pointless APIC version check
This historical leftover is really uninteresting today. Whatever MPTABLE or MADT delivers we only trust the hardware anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest) |
||
![]() |
1d90c9f731 |
x86/apic: Nuke unused apic::inquire_remote_apic()
Put it to the other historical leftovers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest) |
||
![]() |
a6625b473b |
x86/apic: Get rid of hard_smp_processor_id()
No point in having a wrapper around read_apic_id(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest) |
||
![]() |
d7114f83ee |
x86/smpboot: Change smp_store_boot_cpu_info() to static
The function is only used locally. Convert it to a static one. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230727180533.3119660-4-sohil.mehta@intel.com |
||
![]() |
52defa4a5e |
x86/smpboot: Remove a stray comment about CPU hotplug
This old comment is irrelavant to the logic of disabling interrupts and could be misleading. Remove it. Now, hlt_play_dead() resembles the code that the comment was initially added for, but, it doesn't make sense anymore because an offlined cpu could also be put into other states such as mwait. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230727180533.3119660-2-sohil.mehta@intel.com |
||
![]() |
91b4a7dbfe |
cpu/SMT: Remove topology_smt_supported()
Since the maximum number of threads is now passed to cpu_smt_set_num_threads(), checking that value is enough to know whether SMT is supported. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-6-ldufour@linux.ibm.com |
||
![]() |
17953249bf |
x86/sched: Enable cluster scheduling on Hybrid
With the SMT vs non-SMT balancing issues sorted, also enable the cluster domain for Hybrid machines. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> |
||
![]() |
e3da8db055 |
A single fix for the mechanism to park CPUs with an INIT IPI.
On shutdown or kexec, the kernel tries to park the non-boot CPUs with an INIT IPI. But the same code path is also used by the crash utility. If the CPU which panics is not the boot CPU then it sends an INIT IPI to the boot CPU which resets the machine. Prevent this by validating that the CPU which runs the stop mechanism is the boot CPU. If not, leave the other CPUs in HLT. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSqoEcTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoYNSEACwo5zgibek27qeMvJGfNztm0qRa4mw wN0qV31yaNcEfhqL8bMU8n3wvEA+pZBqhaU5fyalY+yxc29jI/j9eda5zR+Fi9e5 kVyFT2M0rVSDLFraoQeD+T/tSSK2MJtswF12ytY5mHzHMCb6Uy9fNCpUiQlB+i81 AcnlKQk9ifZXFdMJPj5E+E6l776T8NZPoYEdFgJloxaYOGTdFJDWDlryx4LD7Urz Fx/ec8Ug/FYSPl2XzXHugvHjNefxKoomcZ3v3CSZonBcav7Gz6F06HAR5vVRWSHx 4Dlh6zdy+60YKBmkvpb+RJIBMo8aXclwT+tntaoJvGHZ+PNASO6JVz9PvmoNgfWK Oy2n1K687qIOY6d+yxUZgbZpwXX5bG6kc0xbicUNigGagrYTfd83G5RAfwxNkqsY 23Qw4Ue8uxve4M8iM/FfxKIShuDBiLCIDDIrWDjEkvIAnr1pd+NPUv7kqOTI7Kz/ srNgcOwalypzuS93lgaN1yjRv1mmaPXhdhjy0DwGbC54bKgNzfq+7z75Ibn0dSFF JUPFVjztB+ymnM6PJ1dR77SvPi+xOi60nw7L+Qu9US4yKkW0NeGiIWVsggNorbU6 UPFSE5gxwFD0w1EZ9W+IDeOZUNhjJUINZsn8txm+tb+oEqTIGRPHPOo0C1dBmLW9 AmDIeHljj0iWIw== =DOCF -----END PGP SIGNATURE----- Merge tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "A single fix for the mechanism to park CPUs with an INIT IPI. On shutdown or kexec, the kernel tries to park the non-boot CPUs with an INIT IPI. But the same code path is also used by the crash utility. If the CPU which panics is not the boot CPU then it sends an INIT IPI to the boot CPU which resets the machine. Prevent this by validating that the CPU which runs the stop mechanism is the boot CPU. If not, leave the other CPUs in HLT" * tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/smp: Don't send INIT to boot CPU |
||
![]() |
b1472a60a5 |
x86/smp: Don't send INIT to boot CPU
Parking CPUs in INIT works well, except for the crash case when the CPU
which invokes smp_park_other_cpus_in_init() is not the boot CPU. Sending
INIT to the boot CPU resets the whole machine.
Prevent this by validating that this runs on the boot CPU. If not fall back
and let CPUs hang in HLT.
Fixes:
|
||
![]() |
ed3b7923a8 |
Scheduler changes for v6.5:
- Scheduler SMP load-balancer improvements: - Avoid unnecessary migrations within SMT domains on hybrid systems. Problem: On hybrid CPU systems, (processors with a mixture of higher-frequency SMT cores and lower-frequency non-SMT cores), under the old code lower-priority CPUs pulled tasks from the higher-priority cores if more than one SMT sibling was busy - resulting in many unnecessary task migrations. Solution: The new code improves the load balancer to recognize SMT cores with more than one busy sibling and allows lower-priority CPUs to pull tasks, which avoids superfluous migrations and lets lower-priority cores inspect all SMT siblings for the busiest queue. - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection. This improves CPU utilization for certain workloads, while leaves other key workloads unchanged. - Scheduler infrastructure improvements: - Rewrite the scheduler topology setup code by consolidating it into the build_sched_topology() helper function and building it dynamically on the fly. - Resolve the local_clock() vs. noinstr complications by rewriting the code: provide separate sched_clock_noinstr() and local_clock_noinstr() functions to be used in instrumentation code, and make sure it is all instrumentation-safe. - Fixes: - Fix a kthread_park() race with wait_woken() - Fix misc wait_task_inactive() bugs unearthed by the -rt merge: - Fix UP PREEMPT bug by unifying the SMP and UP implementations. - Fix task_struct::saved_state handling. - Fix various rq clock update bugs, unearthed by turning on the rq clock debugging code. - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by creating enough cgroups, by removing the warnign and restricting window size triggers to PSI file write-permission or CAP_SYS_RESOURCE. - Propagate SMT flags in the topology when removing degenerate domain - Fix grub_reclaim() calculation bug in the deadline scheduler code - Avoid resetting the min update period when it is unnecessary, in psi_trigger_destroy(). - Don't balance a task to its current running CPU in load_balance(), which was possible on certain NUMA topologies with overlapping groups. - Fix the sched-debug printing of rq->nr_uninterruptible - Cleanups: - Address various -Wmissing-prototype warnings, as a preparation to (maybe) enable this warning in the future. - Remove unused code - Mark more functions __init - Fix shadow-variable warnings Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSatWQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1j62xAAuGOx1LcDfRGC6WGQzp1zOdlsVQtnDvlS qL58zYSHgizprpVQ3j87SBaG4CHCdvd2Bo36yW0lNZS4nd203qdq7fkrMb3hPP/w egUQUzMegf5fF6BWldKeMjuHSt+twFQz/ZAKK8iSbAir6CHNAqbNst1oL0i/+Tyk o33hBs1hT5tnbFb1NSVZkX4k+qT3LzTW4K2QgjjGtkScr6yHh2BdEVefyigWOjdo 9s02d00ll9a2r+F5txlN7Dnw6TN7rmTXGMOJU5bZvBE90/anNiAorMXHJdEKCyUR u9+JtBdJWiCplGa/tSRcxT16ZW1VdtTnd9q66TDhXREd2UNDFqBEyg5Wl77K4Tlf vKFajmj/to+cTbuv6m6TVR+zyXpdEpdL6F04P44U3qiJvDobBqeDNKHHIqpmbHXl AXUXcPWTVAzXX1Ce5M+BeAgTBQ1T7C5tELILrTNQHJvO1s9VVBRFZ/l65Ps4vu7T wIZ781IFuopk0zWqHovNvgKrJ7oFmOQQZFttQEe8n6nafkjI7u+IZ8FayiGaUMRr 4GawFGUCEdYh8z9qyslGKe8Q/Rphfk6hxMFRYUJpDmubQ0PkMeDjDGq77jDGl1PF VqwSDEyOaBJs7Gqf/mem00JtzBmXhkhm1SEjggHMI2IQbr/eeBXoLQOn3CDapO/N PiDbtX760ic= =EWQA -----END PGP SIGNATURE----- Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scheduler SMP load-balancer improvements: - Avoid unnecessary migrations within SMT domains on hybrid systems. Problem: On hybrid CPU systems, (processors with a mixture of higher-frequency SMT cores and lower-frequency non-SMT cores), under the old code lower-priority CPUs pulled tasks from the higher-priority cores if more than one SMT sibling was busy - resulting in many unnecessary task migrations. Solution: The new code improves the load balancer to recognize SMT cores with more than one busy sibling and allows lower-priority CPUs to pull tasks, which avoids superfluous migrations and lets lower-priority cores inspect all SMT siblings for the busiest queue. - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection. This improves CPU utilization for certain workloads, while leaves other key workloads unchanged. Scheduler infrastructure improvements: - Rewrite the scheduler topology setup code by consolidating it into the build_sched_topology() helper function and building it dynamically on the fly. - Resolve the local_clock() vs. noinstr complications by rewriting the code: provide separate sched_clock_noinstr() and local_clock_noinstr() functions to be used in instrumentation code, and make sure it is all instrumentation-safe. Fixes: - Fix a kthread_park() race with wait_woken() - Fix misc wait_task_inactive() bugs unearthed by the -rt merge: - Fix UP PREEMPT bug by unifying the SMP and UP implementations - Fix task_struct::saved_state handling - Fix various rq clock update bugs, unearthed by turning on the rq clock debugging code. - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by creating enough cgroups, by removing the warnign and restricting window size triggers to PSI file write-permission or CAP_SYS_RESOURCE. - Propagate SMT flags in the topology when removing degenerate domain - Fix grub_reclaim() calculation bug in the deadline scheduler code - Avoid resetting the min update period when it is unnecessary, in psi_trigger_destroy(). - Don't balance a task to its current running CPU in load_balance(), which was possible on certain NUMA topologies with overlapping groups. - Fix the sched-debug printing of rq->nr_uninterruptible Cleanups: - Address various -Wmissing-prototype warnings, as a preparation to (maybe) enable this warning in the future. - Remove unused code - Mark more functions __init - Fix shadow-variable warnings" * tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle() sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() sched/core: Fixed missing rq clock update before calling set_rq_offline() sched/deadline: Update GRUB description in the documentation sched/deadline: Fix bandwidth reclaim equation in GRUB sched/wait: Fix a kthread_park race with wait_woken() sched/topology: Mark set_sched_topology() __init sched/fair: Rename variable cpu_util eff_util arm64/arch_timer: Fix MMIO byteswap sched/fair, cpufreq: Introduce 'runnable boosting' sched/fair: Refactor CPU utilization functions cpuidle: Use local_clock_noinstr() sched/clock: Provide local_clock_noinstr() x86/tsc: Provide sched_clock_noinstr() clocksource: hyper-v: Provide noinstr sched_clock() clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX x86/vdso: Fix gettimeofday masking math64: Always inline u128 version of mul_u64_u64_shr() s390/time: Provide sched_clock_noinstr() loongarch: Provide noinstr sched_clock_read() ... |
||
![]() |
88afbb21d4 |
A set of fixes for kexec(), reboot and shutdown issues
- Ensure that the WBINVD in stop_this_cpu() has been completed before the control CPU proceedes. stop_this_cpu() is used for kexec(), reboot and shutdown to park the APs in a HLT loop. The control CPU sends an IPI to the APs and waits for their CPU online bits to be cleared. Once they all are marked "offline" it proceeds. But stop_this_cpu() clears the CPU online bit before issuing WBINVD, which means there is no guarantee that the AP has reached the HLT loop. This was reported to cause intermittent reboot/shutdown failures due to some dubious interaction with the firmware. This is not only a problem of WBINVD. The code to actually "stop" the CPU which runs between clearing the online bit and reaching the HLT loop can cause large enough delays on its own (think virtualization). That's especially dangerous for kexec() as kexec() expects that all APs are in a safe state and not executing code while the boot CPU jumps to the new kernel. There are more issues vs. kexec() which are addressed separately. Cure this by implementing an explicit synchronization point right before the AP reaches HLT. This guarantees that the AP has completed the full stop proceedure. - Fix the condition for WBINVD in stop_this_cpu(). The WBINVD in stop_this_cpu() is required for ensuring that when switching to or from memory encryption no dirty data is left in the cache lines which might cause a write back in the wrong more later. This checks CPUID directly because the feature bit might have been cleared due to a command line option. But that CPUID check accesses leaf 0x8000001f::EAX unconditionally. Intel CPUs return the content of the highest supported leaf when a non-existing leaf is read, while AMD CPUs return all zeros for unsupported leafs. So the result of the test on Intel CPUs is lottery and on AMD its just correct by chance. While harmless it's incorrect and causes the conditional wbinvd() to be issued where not required, which caused the above issue to be unearthed. - Make kexec() robust against AP code execution Ashok observed triple faults when doing kexec() on a system which had been booted with "nosmt". It turned out that the SMT siblings which had been brought up partially are parked in mwait_play_dead() to enable power savings. mwait_play_dead() is monitoring the thread flags of the AP's idle task, which has been chosen as it's unlikely to be written to. But kexec() can overwrite the previous kernel text and data including page tables etc. When it overwrites the cache lines monitored by an AP that AP resumes execution after the MWAIT on eventually overwritten text, stack and page tables, which obviously might end up in a triple fault easily. Make this more robust in several steps: 1) Use an explicit per CPU cache line for monitoring. 2) Write a command to these cache lines to kick APs out of MWAIT before proceeding with kexec(), shutdown or reboot. The APs confirm the wakeup by writing status back and then enter a HLT loop. 3) If the system uses INIT/INIT/STARTUP for AP bringup, park the APs in INIT state. HLT is not a guarantee that an AP won't wake up and resume execution. HLT is woken up by NMI and SMI. SMI puts the CPU back into HLT (+/- firmware bugs), but NMI is delivered to the CPU which executes the NMI handler. Same issue as the MWAIT scenario described above. Sending an INIT/INIT sequence to the APs puts them into wait for STARTUP state, which is safe against NMI. There is still an issue remaining which can't be fixed: #MCE If the AP sits in HLT and receives a broadcast #MCE it will try to handle it with the obvious consequences. INIT/INIT clears CR4.MCE in the AP which will cause a broadcast #MCE to shut down the machine. So there is a choice between fire (HLT) and frying pan (INIT). Frying pan has been chosen as it's at least preventing the NMI issue. On systems which are not using INIT/INIT/STARTUP there is not much which can be done right now, but at least the obvious and easy to trigger MWAIT issue has been addressed. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZfpQTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoeZpD/9gSJN2qtGqoOgE8bWAenEeqppmBGFE EAhuhsvN1qG9JosUFo4KzxsGD/aWt2P6XglBDrGti8mFNol67jutmwWklntL3/ZR m8D6D+Pl7/CaDgACDTDbrnVC3lOGyMhD301yJrnBigS/SEoHeHI9UtadbHukuLQj TlKt5KtAnap15bE6QL846cDIptB9SjYLLPULo3i4azXEis/l6eAkffwAR6dmKlBh 2RbhLK1xPPG9nqWYjqZXnex09acKwD9xY9xHj4+GampV4UqHJRWfW0YtFs5ENi01 r3FVCdKEcvMkUw0zh0IAviBRs2vCI/R3YSfEc7P0264yn5WzMhAT+OGCovNjByiW sB4Iqa+Yf6aoBWwux6W4d22xu7uYhmFk/jiLyRZJPW/gvGZCZATT/x/T2hRoaYA8 3S0Rs7n/gbfvynQETgniifuM0bXRW0lEJAmn840GwyVQwlpDEPBJSwW4El49kbkc +dHxnmpMCfnBxfVLS1YDd4WOmkWBeECNcW330FShlQQ8mM3UG31+Q8Jc55Ze9SW0 w1h+IgIOHlA0DpQUUM8DJTSuxFx2piQsZxjOtzd70+BiKZpCsHqVLIp4qfnf+/GO gyP0cCQLbafpABbV9uVy8A/qgUGi0Qii0GJfCTy0OdmU+JX3C2C/gsM3uN0g3qAj vUhkuCXEGL5k1w== =KgZ0 -----END PGP SIGNATURE----- Merge tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Thomas Gleixner: "A set of fixes for kexec(), reboot and shutdown issues: - Ensure that the WBINVD in stop_this_cpu() has been completed before the control CPU proceedes. stop_this_cpu() is used for kexec(), reboot and shutdown to park the APs in a HLT loop. The control CPU sends an IPI to the APs and waits for their CPU online bits to be cleared. Once they all are marked "offline" it proceeds. But stop_this_cpu() clears the CPU online bit before issuing WBINVD, which means there is no guarantee that the AP has reached the HLT loop. This was reported to cause intermittent reboot/shutdown failures due to some dubious interaction with the firmware. This is not only a problem of WBINVD. The code to actually "stop" the CPU which runs between clearing the online bit and reaching the HLT loop can cause large enough delays on its own (think virtualization). That's especially dangerous for kexec() as kexec() expects that all APs are in a safe state and not executing code while the boot CPU jumps to the new kernel. There are more issues vs kexec() which are addressed separately. Cure this by implementing an explicit synchronization point right before the AP reaches HLT. This guarantees that the AP has completed the full stop proceedure. - Fix the condition for WBINVD in stop_this_cpu(). The WBINVD in stop_this_cpu() is required for ensuring that when switching to or from memory encryption no dirty data is left in the cache lines which might cause a write back in the wrong more later. This checks CPUID directly because the feature bit might have been cleared due to a command line option. But that CPUID check accesses leaf 0x8000001f::EAX unconditionally. Intel CPUs return the content of the highest supported leaf when a non-existing leaf is read, while AMD CPUs return all zeros for unsupported leafs. So the result of the test on Intel CPUs is lottery and on AMD its just correct by chance. While harmless it's incorrect and causes the conditional wbinvd() to be issued where not required, which caused the above issue to be unearthed. - Make kexec() robust against AP code execution Ashok observed triple faults when doing kexec() on a system which had been booted with "nosmt". It turned out that the SMT siblings which had been brought up partially are parked in mwait_play_dead() to enable power savings. mwait_play_dead() is monitoring the thread flags of the AP's idle task, which has been chosen as it's unlikely to be written to. But kexec() can overwrite the previous kernel text and data including page tables etc. When it overwrites the cache lines monitored by an AP that AP resumes execution after the MWAIT on eventually overwritten text, stack and page tables, which obviously might end up in a triple fault easily. Make this more robust in several steps: 1) Use an explicit per CPU cache line for monitoring. 2) Write a command to these cache lines to kick APs out of MWAIT before proceeding with kexec(), shutdown or reboot. The APs confirm the wakeup by writing status back and then enter a HLT loop. 3) If the system uses INIT/INIT/STARTUP for AP bringup, park the APs in INIT state. HLT is not a guarantee that an AP won't wake up and resume execution. HLT is woken up by NMI and SMI. SMI puts the CPU back into HLT (+/- firmware bugs), but NMI is delivered to the CPU which executes the NMI handler. Same issue as the MWAIT scenario described above. Sending an INIT/INIT sequence to the APs puts them into wait for STARTUP state, which is safe against NMI. There is still an issue remaining which can't be fixed: #MCE If the AP sits in HLT and receives a broadcast #MCE it will try to handle it with the obvious consequences. INIT/INIT clears CR4.MCE in the AP which will cause a broadcast #MCE to shut down the machine. So there is a choice between fire (HLT) and frying pan (INIT). Frying pan has been chosen as it's at least preventing the NMI issue. On systems which are not using INIT/INIT/STARTUP there is not much which can be done right now, but at least the obvious and easy to trigger MWAIT issue has been addressed" * tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/smp: Put CPUs into INIT on shutdown if possible x86/smp: Split sending INIT IPI out into a helper function x86/smp: Cure kexec() vs. mwait_play_dead() breakage x86/smp: Use dedicated cache-line for mwait_play_dead() x86/smp: Remove pointless wmb()s from native_stop_other_cpus() x86/smp: Dont access non-existing CPUID leaf x86/smp: Make stop_other_cpus() more robust |
||
![]() |
9244724fbf |
A large update for SMP management:
- Parallel CPU bringup The reason why people are interested in parallel bringup is to shorten the (kexec) reboot time of cloud servers to reduce the downtime of the VM tenants. The current fully serialized bringup does the following per AP: 1) Prepare callbacks (allocate, intialize, create threads) 2) Kick the AP alive (e.g. INIT/SIPI on x86) 3) Wait for the AP to report alive state 4) Let the AP continue through the atomic bringup 5) Let the AP run the threaded bringup to full online state There are two significant delays: #3 The time for an AP to report alive state in start_secondary() on x86 has been measured in the range between 350us and 3.5ms depending on vendor and CPU type, BIOS microcode size etc. #4 The atomic bringup does the microcode update. This has been measured to take up to ~8ms on the primary threads depending on the microcode patch size to apply. On a two socket SKL server with 56 cores (112 threads) the boot CPU spends on current mainline about 800ms busy waiting for the APs to come up and apply microcode. That's more than 80% of the actual onlining procedure. This can be reduced significantly by splitting the bringup mechanism into two parts: 1) Run the prepare callbacks and kick the AP alive for each AP which needs to be brought up. The APs wake up, do their firmware initialization and run the low level kernel startup code including microcode loading in parallel up to the first synchronization point. (#1 and #2 above) 2) Run the rest of the bringup code strictly serialized per CPU (#3 - #5 above) as it's done today. Parallelizing that stage of the CPU bringup might be possible in theory, but it's questionable whether required surgery would be justified for a pretty small gain. If the system is large enough the first AP is already waiting at the first synchronization point when the boot CPU finished the wake-up of the last AP. That reduces the AP bringup time on that SKL from ~800ms to ~80ms, i.e. by a factor ~10x. The actual gain varies wildly depending on the system, CPU, microcode patch size and other factors. There are some opportunities to reduce the overhead further, but that needs some deep surgery in the x86 CPU bringup code. For now this is only enabled on x86, but the core functionality obviously works for all SMP capable architectures. - Enhancements for SMP function call tracing so it is possible to locate the scheduling and the actual execution points. That allows to measure IPI delivery time precisely. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZb/YTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoRoOD/9vAiGI3IhGyZcX/RjXxauSHf8Pmqll 05jUubFi5Vi3tKI1ubMOsnMmJTw2yy5xDyS/iGj7AcbRLq9uQd3iMtsXXHNBzo/X FNxnuWTXYUj0vcOYJ+j4puBumFzzpRCprqccMInH0kUnSWzbnaQCeelicZORAf+w zUYrswK4HpBXHDOnvPw6Z7MYQe+zyDQSwjSftstLyROzu+lCEw/9KUaysY2epShJ wHClxS2XqMnpY4rJ/CmJAlRhD0Plb89zXyo6k9YZYVDWoAcmBZy6vaTO4qoR171L 37ApqrgsksMkjFycCMnmrFIlkeb7bkrYDQ5y+xqC3JPTlYDKOYmITV5fZ83HD77o K7FAhl/CgkPq2Ec+d82GFLVBKR1rijbwHf7a0nhfUy0yMeaJCxGp4uQ45uQ09asi a/VG2T38EgxVdseC92HRhcdd3pipwCb5wqjCH/XdhdlQrk9NfeIeP+TxF4QhADhg dApp3ifhHSnuEul7+HNUkC6U+Zc8UeDPdu5lvxSTp2ooQ0JwaGgC5PJq3nI9RUi2 Vv826NHOknEjFInOQcwvp6SJPfcuSTF75Yx6xKz8EZ3HHxpvlolxZLq+3ohSfOKn 2efOuZO5bEu4S/G2tRDYcy+CBvNVSrtZmCVqSOS039c8quBWQV7cj0334cjzf+5T TRiSzvssbYYmaw== =Y8if -----END PGP SIGNATURE----- Merge tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP updates from Thomas Gleixner: "A large update for SMP management: - Parallel CPU bringup The reason why people are interested in parallel bringup is to shorten the (kexec) reboot time of cloud servers to reduce the downtime of the VM tenants. The current fully serialized bringup does the following per AP: 1) Prepare callbacks (allocate, intialize, create threads) 2) Kick the AP alive (e.g. INIT/SIPI on x86) 3) Wait for the AP to report alive state 4) Let the AP continue through the atomic bringup 5) Let the AP run the threaded bringup to full online state There are two significant delays: #3 The time for an AP to report alive state in start_secondary() on x86 has been measured in the range between 350us and 3.5ms depending on vendor and CPU type, BIOS microcode size etc. #4 The atomic bringup does the microcode update. This has been measured to take up to ~8ms on the primary threads depending on the microcode patch size to apply. On a two socket SKL server with 56 cores (112 threads) the boot CPU spends on current mainline about 800ms busy waiting for the APs to come up and apply microcode. That's more than 80% of the actual onlining procedure. This can be reduced significantly by splitting the bringup mechanism into two parts: 1) Run the prepare callbacks and kick the AP alive for each AP which needs to be brought up. The APs wake up, do their firmware initialization and run the low level kernel startup code including microcode loading in parallel up to the first synchronization point. (#1 and #2 above) 2) Run the rest of the bringup code strictly serialized per CPU (#3 - #5 above) as it's done today. Parallelizing that stage of the CPU bringup might be possible in theory, but it's questionable whether required surgery would be justified for a pretty small gain. If the system is large enough the first AP is already waiting at the first synchronization point when the boot CPU finished the wake-up of the last AP. That reduces the AP bringup time on that SKL from ~800ms to ~80ms, i.e. by a factor ~10x. The actual gain varies wildly depending on the system, CPU, microcode patch size and other factors. There are some opportunities to reduce the overhead further, but that needs some deep surgery in the x86 CPU bringup code. For now this is only enabled on x86, but the core functionality obviously works for all SMP capable architectures. - Enhancements for SMP function call tracing so it is possible to locate the scheduling and the actual execution points. That allows to measure IPI delivery time precisely" * tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) trace,smp: Add tracepoints for scheduling remotelly called functions trace,smp: Add tracepoints around remotelly called functions MAINTAINERS: Add CPU HOTPLUG entry x86/smpboot: Fix the parallel bringup decision x86/realmode: Make stack lock work in trampoline_compat() x86/smp: Initialize cpu_primary_thread_mask late cpu/hotplug: Fix off by one in cpuhp_bringup_mask() x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it x86/smpboot: Support parallel startup of secondary CPUs x86/smpboot: Implement a bit spinlock to protect the realmode stack x86/apic: Save the APIC virtual base address cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE x86/apic: Provide cpu_primary_thread mask x86/smpboot: Enable split CPU startup cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism cpu/hotplug: Reset task stack state in _cpu_up() cpu/hotplug: Remove unused state functions riscv: Switch to hotplug core state synchronization parisc: Switch to hotplug core state synchronization ... |
||
![]() |
45e34c8af5 |
x86/smp: Put CPUs into INIT on shutdown if possible
Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can resume execution due to NMI, SMI and MCE, which has the same issue as the MWAIT loop. Kicking the secondary CPUs into INIT makes this safe against NMI and SMI. A broadcast MCE will take the machine down, but a broadcast MCE which makes HLT resume and execute overwritten text, pagetables or data will end up in a disaster too. So chose the lesser of two evils and kick the secondary CPUs into INIT unless the system has installed special wakeup mechanisms which are not using INIT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230615193330.608657211@linutronix.de |
||
![]() |
6087dd5e86 |
x86/smp: Split sending INIT IPI out into a helper function
Putting CPUs into INIT is a safer place during kexec() to park CPUs. Split the INIT assert/deassert sequence out so it can be reused. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Link: https://lore.kernel.org/r/20230615193330.551157083@linutronix.de |
||
![]() |
d7893093a7 |
x86/smp: Cure kexec() vs. mwait_play_dead() breakage
TLDR: It's a mess.
When kexec() is executed on a system with offline CPUs, which are parked in
mwait_play_dead() it can end up in a triple fault during the bootup of the
kexec kernel or cause hard to diagnose data corruption.
The reason is that kexec() eventually overwrites the previous kernel's text,
page tables, data and stack. If it writes to the cache line which is
monitored by a previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.
Cure this by bringing the offlined CPUs out of MWAIT into HLT.
Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offlined CPUs to issue
HLT, which does not have the MWAIT problem.
That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as
those make it come out of HLT.
A follow up change will put them into INIT, which protects at least against
NMI and SMI.
Fixes:
|
||
![]() |
f9c9987bf5 |
x86/smp: Use dedicated cache-line for mwait_play_dead()
Monitoring idletask::thread_info::flags in mwait_play_dead() has been an obvious choice as all what is needed is a cache line which is not written by other CPUs. But there is a use case where a "dead" CPU needs to be brought out of MWAIT: kexec(). This is required as kexec() can overwrite text, pagetables, stacks and the monitored cacheline of the original kernel. The latter causes MWAIT to resume execution which obviously causes havoc on the kexec kernel which results usually in triple faults. Use a dedicated per CPU storage to prepare for that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.434553750@linutronix.de |
||
![]() |
8f2d6c41e5 |
x86/sched: Rewrite topology setup
Instead of having a number of fixed topologies to pick from; build one on the fly. This is both simpler now and simpler to extend in the future. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230601153522.GB559993%40hirez.programming.kicks-ass.net |
||
![]() |
ff3cfcb0d4 |
x86/smpboot: Fix the parallel bringup decision
The decision to allow parallel bringup of secondary CPUs checks
CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use
parallel bootup because accessing the local APIC is intercepted and raises
a #VC or #VE, which cannot be handled at that point.
The check works correctly, but only for AMD encrypted guests. TDX does not
set that flag.
As there is no real connection between CC attributes and the inability to
support parallel bringup, replace this with a generic control flag in
x86_cpuinit and let SEV-ES and TDX init code disable it.
Fixes:
|
||
![]() |
0c7ffa32db |
x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
Implement the validation function which tells the core code whether parallel bringup is possible. The only condition for now is that the kernel does not run in an encrypted guest as these will trap the RDMSR via #VC, which cannot be handled at that point in early startup. There was an earlier variant for AMD-SEV which used the GHBC protocol for retrieving the APIC ID via CPUID, but there is no guarantee that the initial APIC ID in CPUID is the same as the real APIC ID. There is no enforcement from the secure firmware and the hypervisor can assign APIC IDs as it sees fit as long as the ACPI/MADT table is consistent with that assignment. Unfortunately there is no RDMSR GHCB protocol at the moment, so enabling AMD-SEV guests for parallel startup needs some more thought. Intel-TDX provides a secure RDMSR hypercall, but supporting that is outside the scope of this change. Fixup announce_cpu() as e.g. on Hyper-V CPU1 is the secondary sibling of CPU0, which makes the @cpu == 1 logic in announce_cpu() fall apart. [ mikelley: Reported the announce_cpu() fallout Originally-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.467571745@linutronix.de |
||
![]() |
7e75178a09 |
x86/smpboot: Support parallel startup of secondary CPUs
In parallel startup mode the APs are kicked alive by the control CPU quickly after each other and run through the early startup code in parallel. The real-mode startup code is already serialized with a bit-spinlock to protect the real-mode stack. In parallel startup mode the smpboot_control variable obviously cannot contain the Linux CPU number so the APs have to determine their Linux CPU number on their own. This is required to find the CPUs per CPU offset in order to find the idle task stack and other per CPU data. To achieve this, export the cpuid_to_apicid[] array so that each AP can find its own CPU number by searching therein based on its APIC ID. Introduce a flag in the top bits of smpboot_control which indicates that the AP should find its CPU number by reading the APIC ID from the APIC. This is required because CPUID based APIC ID retrieval can only provide the initial APIC ID, which might have been overruled by the firmware. Some AMD APUs come up with APIC ID = initial APIC ID + 0x10, so the APIC ID to CPU number lookup would fail miserably if based on CPUID. Also virtualization can make its own APIC ID assignements. The only requirement is that the APIC IDs are consistent with the APCI/MADT table. For the boot CPU or in case parallel bringup is disabled the control bits are empty and the CPU number is directly available in bit 0-23 of smpboot_control. [ tglx: Initial proof of concept patch with bitlock and APIC ID lookup ] [ dwmw2: Rework and testing, commit message, CPUID 0x1 and CPU0 support ] [ seanc: Fix stray override of initial_gs in common_cpu_up() ] [ Oleksandr Natalenko: reported suspend/resume issue fixed in x86_acpi_suspend_lowlevel ] [ tglx: Make it read the APIC ID from the APIC instead of using CPUID, split the bitlock part out ] Co-developed-by: Thomas Gleixner <tglx@linutronix.de> Co-developed-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.411554373@linutronix.de |
||
![]() |
f54d4434c2 |
x86/apic: Provide cpu_primary_thread mask
Make the primary thread tracking CPU mask based in preparation for simpler handling of parallel bootup. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.186599880@linutronix.de |
||
![]() |
8b5a0f957c |
x86/smpboot: Enable split CPU startup
The x86 CPU bringup state currently does AP wake-up, wait for AP to respond and then release it for full bringup. It is safe to be split into a wake-up and and a separate wait+release state. Provide the required functions and enable the split CPU bringup, which prepares for parallel bringup, where the bringup of the non-boot CPUs takes two iterations: One to prepare and wake all APs and the second to wait and release them. Depending on timing this can eliminate the wait time completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.133453992@linutronix.de |
||
![]() |
2711b8e2b7 |
x86/smpboot: Switch to hotplug core state synchronization
The new AP state tracking and synchronization mechanism in the CPU hotplug core code allows to remove quite some x86 specific code: 1) The AP alive synchronization based on cpumasks 2) The decision whether an AP can be brought up again Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.529657366@linutronix.de |
||
![]() |
e464640cf7 |
x86/smpboot: Remove wait for cpu_online()
Now that the core code drops sparse_irq_lock after the idle thread synchronized, it's pointless to wait for the AP to mark itself online. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.316417181@linutronix.de |
||
![]() |
c8b7fb09d1 |
x86/smpboot: Remove cpu_callin_mask
Now that TSC synchronization is SMP function call based there is no reason to wait for the AP to be set in smp_callin_mask. The control CPU waits for the AP to set itself in the online mask anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.206394064@linutronix.de |
||
![]() |
9d349d47f0 |
x86/smpboot: Make TSC synchronization function call based
Spin-waiting on the control CPU until the AP reaches the TSC synchronization is just a waste especially in the case that there is no synchronization required. As the synchronization has to run with interrupts disabled the control CPU part can just be done from a SMP function call. The upcoming AP issues that call async only in the case that synchronization is required. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.148255496@linutronix.de |
||
![]() |
d4f28f07c2 |
x86/smpboot: Move synchronization masks to SMP boot code
The usage is in smpboot.c and not in the CPU initialization code. The XEN_PV usage of cpu_callout_mask is obsolete as cpu_init() not longer waits and cacheinfo has its own CPU mask now, so cpu_callout_mask can be made static too. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.091511483@linutronix.de |
||
![]() |
e94cd1503b |
x86/smpboot: Get rid of cpu_init_secondary()
The synchronization of the AP with the control CPU is a SMP boot problem and has nothing to do with cpu_init(). Open code cpu_init_secondary() in start_secondary() and move wait_for_master_cpu() into the SMP boot code. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.981999763@linutronix.de |
||
![]() |
2b3be65d2e |
x86/smpboot: Split up native_cpu_up() into separate phases and document them
There are four logical parts to what native_cpu_up() does on the BSP (or on the controlling CPU for a later hotplug): 1) Wake the AP by sending the INIT/SIPI/SIPI sequence. 2) Wait for the AP to make it as far as wait_for_master_cpu() which sets that CPU's bit in cpu_initialized_mask, then sets the bit in cpu_callout_mask to let the AP proceed through cpu_init(). 3) Wait for the AP to finish cpu_init() and get as far as the smp_callin() call, which sets that CPU's bit in cpu_callin_mask. 4) Perform the TSC synchronization and wait for the AP to actually mark itself online in cpu_online_mask. In preparation to allow these phases to operate in parallel on multiple APs, split them out into separate functions and document the interactions a little more clearly in both the BP and AP code paths. No functional change intended. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.928917242@linutronix.de |
||
![]() |
c7f15dd3f0 |
x86/smpboot: Remove unnecessary barrier()
Peter stumbled over the barrier() after the invocation of smp_callin() in start_secondary(): "...this barrier() and it's comment seem weird vs smp_callin(). That function ends with an atomic bitop (it has to, at the very least it must not be weaker than store-release) but also has an explicit wmb() to order setup vs CPU_STARTING. There is no way the smp_processor_id() referred to in this comment can land before cpu_init() even without the barrier()." The barrier() along with the comment was added in 2003 with commit d8f19f2cac70 ("[PATCH] x86-64 merge") in the history tree. One of those well documented combo patches of that time which changes world and some more. The context back then was: /* * Dont put anything before smp_callin(), SMP * booting is too fragile that we want to limit the * things done here to the most necessary things. */ cpu_init(); smp_callin(); + /* otherwise gcc will move up smp_processor_id before the cpu_init */ + barrier(); Dprintk("cpu %d: waiting for commence\n", smp_processor_id()); Even back in 2003 the compiler was not allowed to reorder that smp_processor_id() invocation before the cpu_init() function call. Especially not as smp_processor_id() resolved to: asm volatile("movl %%gs:%c1,%0":"=r" (ret__):"i"(pda_offset(field)):"memory"); There is no trace of this change in any mailing list archive including the back then official x86_64 list discuss@x86-64.org, which would explain the problem this change solved. The debug prints are gone by now and the the only smp_processor_id() invocation today is farther down in start_secondary() after locking vector_lock which itself prevents reordering. Even if the compiler would be allowed to reorder this, the code would still be correct as GSBASE is set up early in the assembly code and is valid when the CPU reaches start_secondary(), while the code at the time when this barrier was added did the GSBASE setup in cpu_init(). As the barrier has zero value, remove it. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.875713771@linutronix.de |
||
![]() |
5475abbde7 |
x86/smpboot: Remove the CPU0 hotplug kludge
This was introduced with commit
|
||
![]() |
134a12827b |
x86/smpboot: Avoid pointless delay calibration if TSC is synchronized
When TSC is synchronized across sockets then there is no reason to calibrate the delay for the first CPU which comes up on a socket. Just reuse the existing calibration value. This removes 100ms pointlessly wasted time from CPU hotplug per socket. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.608773568@linutronix.de |
||
![]() |
ba831b7b1a |
cpu/hotplug: Mark arch_disable_smp_support() and bringup_nonboot_cpus() __init
No point in keeping them around. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.551974164@linutronix.de |
||
![]() |
5107e3ebb8 |
x86/smpboot: Cleanup topology_phys_to_logical_pkg()/die()
Make topology_phys_to_logical_pkg_die() static as it's only used in smpboot.c and fixup the kernel-doc warnings for both functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.493750666@linutronix.de |
||
![]() |
044f0e27de |
x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors
Intel Meteor Lake hybrid processors have cores in two separate dies. The cores in one of the dies have higher maximum frequency. Use the SD_ASYM_ PACKING flag to give higher priority to the die with CPUs of higher maximum frequency. Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230406203148.19182-13-ricardo.neri-calderon@linux.intel.com |
||
![]() |
995998ebde |
x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
There is no difference between any of the SMT siblings of a physical core. Do not do asym_packing load balancing at this level. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230406203148.19182-11-ricardo.neri-calderon@linux.intel.com |
||
![]() |
2aff7c706c |
Objtool changes for v6.4:
- Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did this inconsistently follow this new, common convention, and fix all the fallout that objtool can now detect statically. - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it. - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code. - Generate ORC data for __pfx code - Add more __noreturn annotations to various kernel startup/shutdown/panic functions. - Misc improvements & fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK1x0RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1ghxQ/+IkCynMYtdF5OG9YwbcGJqsPSfOPMEcEM pUSFYg+gGPBDT/fJfcVSqvUtdnWbLC2kXt9yiswXz3X3J2nmNkBk5YKQftsNDcul TmKeqIIAK51XTncpegKH0EGnOX63oZ9Vxa8CTPdDlb+YF23Km2FoudGRI9F5qbUd LoraXqGYeiaeySkGyWmZVl6Uc8dIxnMkTN3H/oI9aB6TOrsi059hAtFcSaFfyemP c4LqXXCH7k2baiQt+qaLZ8cuZVG/+K5r2N2cmjO5kmJc6ynIaFnfMe4XxZLjp5LT /PulYI15bXkvSARKx5CRh/CDHMOx5Blw+ASO0RhWbdy0WH4ZhhcaVF5AeIpPW86a 1LBcz97rMp72WmvKgrJeVO1r9+ll4SI6/YKGJRsxsCMdP3hgFpqntXyVjTFNdTM1 0gH6H5v55x06vJHvhtTk8SR3PfMTEM2fRU5jXEOrGowoGifx+wNUwORiwj6LE3KQ SKUdT19RNzoW3VkFxhgk65ThK1S7YsJUKRoac3YdhttpqqqtFV//erenrZoR4k/p vzvKy68EQ7RCNyD5wNWNFe0YjeJl5G8gQ8bUm4Xmab7djjgz+pn4WpQB8yYKJLAo x9dqQ+6eUbw3Hcgk6qQ9E+r/svbulnAL0AeALAWK/91DwnZ2mCzKroFkLN7napKi fRho4CqzrtM= =NwEV -----END PGP SIGNATURE----- Merge tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did this inconsistently follow this new, common convention, and fix all the fallout that objtool can now detect statically - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code - Generate ORC data for __pfx code - Add more __noreturn annotations to various kernel startup/shutdown and panic functions - Misc improvements & fixes * tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/hyperv: Mark hv_ghcb_terminate() as noreturn scsi: message: fusion: Mark mpt_halt_firmware() __noreturn x86/cpu: Mark {hlt,resume}_play_dead() __noreturn btrfs: Mark btrfs_assertfail() __noreturn objtool: Include weak functions in global_noreturns check cpu: Mark nmi_panic_self_stop() __noreturn cpu: Mark panic_smp_self_stop() __noreturn arm64/cpu: Mark cpu_park_loop() and friends __noreturn x86/head: Mark *_start_kernel() __noreturn init: Mark start_kernel() __noreturn init: Mark [arch_call_]rest_init() __noreturn objtool: Generate ORC data for __pfx code x86/linkage: Fix padding for typed functions objtool: Separate prefix code from stack validation code objtool: Remove superfluous dead_end_function() check objtool: Add symbol iteration helpers objtool: Add WARN_INSN() scripts/objdump-func: Support multiple functions context_tracking: Fix KCSAN noinstr violation objtool: Add stackleak instrumentation to uaccess safe list ... |
||
![]() |
52668badd3 |
x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
Fixes the following warning: vmlinux.o: warning: objtool: resume_play_dead+0x21: unreachable instruction Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/ce1407c4bf88b1334fe40413126343792a77ca50.1681342859.git.jpoimboe@kernel.org |
||
![]() |
805ae9dc3b |
x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()
When bringing up a secondary CPU from do_boot_cpu(), the warm reset flag is set in CMOS and the starting IP for the trampoline written inside the BDA at 0x467. Once the CPU is running, the CMOS flag is unset and the value in the BDA cleared. To allow for parallel bringup of CPUs, add a reference count to track the number of CPUs currently bring brought up, and clear the state only when the count reaches zero. Since the RTC spinlock is required to write to the CMOS, it can be used for mutual exclusion on the refcount too. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Kim Phillips <kim.phillips@amd.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Link: https://lore.kernel.org/r/20230316222109.1940300-5-usama.arif@bytedance.com |
||
![]() |
8f6be6d870 |
x86/smpboot: Remove initial_gs
Given its CPU#, each CPU can find its own per-cpu offset, and directly set GSBASE accordingly. The global variable can be eliminated. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Usama Arif <usama.arif@bytedance.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230316222109.1940300-9-usama.arif@bytedance.com |
||
![]() |
c253b64020 |
x86/smpboot: Remove early_gdt_descr on 64-bit
Build the GDT descriptor on the stack instead. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Usama Arif <usama.arif@bytedance.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230316222109.1940300-8-usama.arif@bytedance.com |
||
![]() |
3adee777ad |
x86/smpboot: Remove initial_stack on 64-bit
In order to facilitate parallel startup, start to eliminate some of the global variables passing information to CPUs in the startup path. However, start by introducing one more: smpboot_control. For now this merely holds the CPU# of the CPU which is coming up. Each CPU can then find its own per-cpu data, and everything else it needs can be found from there, allowing the other global variables to be removed. First to be removed is initial_stack. Each CPU can load %rsp from its current_task->thread.sp instead. That is already set up with the correct idle thread for APs. Set up the .sp field in INIT_THREAD on x86 so that the BSP also finds a suitable stack pointer in the static per-cpu data when coming up on first boot. On resume from S3, the CPU needs a temporary stack because its idle task is already active. Instead of setting initial_stack, the sleep code can simply set its own current->thread.sp to point to the temporary stack. Nobody else cares about ->thread.sp for a thread which is currently on a CPU, because the true value is actually in the %rsp register. Which is restored with the rest of the CPU context in do_suspend_lowlevel(). Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Usama Arif <usama.arif@bytedance.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230316222109.1940300-7-usama.arif@bytedance.com |
||
![]() |
fcb3a81d22 |
x86/hotplug: Remove incorrect comment about mwait_play_dead()
The comment that says mwait_play_dead() returns only on failure is a bit misleading because mwait_play_dead() could actually return for valid reasons (such as mwait not being supported by the platform) that do not indicate a failure of the CPU offline operation. So, remove the comment. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230128003751.141317-1-srivatsa@csail.mit.edu |
||
![]() |
94a855111e |
- Add the call depth tracking mitigation for Retbleed which has
been long in the making. It is a lighterweight software-only fix for Skylake-based cores where enabling IBRS is a big hammer and causes a significant performance impact. What it basically does is, it aligns all kernel functions to 16 bytes boundary and adds a 16-byte padding before the function, objtool collects all functions' locations and when the mitigation gets applied, it patches a call accounting thunk which is used to track the call depth of the stack at any time. When that call depth reaches a magical, microarchitecture-specific value for the Return Stack Buffer, the code stuffs that RSB and avoids its underflow which could otherwise lead to the Intel variant of Retbleed. This software-only solution brings a lot of the lost performance back, as benchmarks suggest: https://lore.kernel.org/all/20220915111039.092790446@infradead.org/ That page above also contains a lot more detailed explanation of the whole mechanism - Implement a new control flow integrity scheme called FineIBT which is based on the software kCFI implementation and uses hardware IBT support where present to annotate and track indirect branches using a hash to validate them - Other misc fixes and cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOZp5EACgkQEsHwGGHe VUrZFxAAvi/+8L0IYSK4mKJvixGbTFjxN/Swo2JVOfs34LqGUT6JaBc+VUMwZxdb VMTFIZ3ttkKEodjhxGI7oGev6V8UfhI37SmO2lYKXpQVjXXnMlv/M+Vw3teE38CN gopi+xtGnT1IeWQ3tc/Tv18pleJ0mh5HKWiW+9KoqgXj0wgF9x4eRYDz1TDCDA/A iaBzs56j8m/FSykZHnrWZ/MvjKNPdGlfJASUCPeTM2dcrXQGJ93+X2hJctzDte0y Nuiw6Y0htfFBE7xoJn+sqm5Okr+McoUM18/CCprbgSKYk18iMYm3ZtAi6FUQZS1A ua4wQCf49loGp15PO61AS5d3OBf5D3q/WihQRbCaJvTVgPp9sWYnWwtcVUuhMllh ZQtBU9REcVJ/22bH09Q9CjBW0VpKpXHveqQdqRDViLJ6v/iI6EFGmD24SW/VxyRd 73k9MBGrL/dOf1SbEzdsnvcSB3LGzp0Om8o/KzJWOomrVKjBCJy16bwTEsCZEJmP i406m92GPXeaN1GhTko7vmF0GnkEdJs1GVCZPluCAxxbhHukyxHnrjlQjI4vC80n Ylc0B3Kvitw7LGJsPqu+/jfNHADC/zhx1qz/30wb5cFmFbN1aRdp3pm8JYUkn+l/ zri2Y6+O89gvE/9/xUhMohzHsWUO7xITiBavewKeTP9GSWybWUs= =cRy1 -----END PGP SIGNATURE----- Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Borislav Petkov: - Add the call depth tracking mitigation for Retbleed which has been long in the making. It is a lighterweight software-only fix for Skylake-based cores where enabling IBRS is a big hammer and causes a significant performance impact. What it basically does is, it aligns all kernel functions to 16 bytes boundary and adds a 16-byte padding before the function, objtool collects all functions' locations and when the mitigation gets applied, it patches a call accounting thunk which is used to track the call depth of the stack at any time. When that call depth reaches a magical, microarchitecture-specific value for the Return Stack Buffer, the code stuffs that RSB and avoids its underflow which could otherwise lead to the Intel variant of Retbleed. This software-only solution brings a lot of the lost performance back, as benchmarks suggest: https://lore.kernel.org/all/20220915111039.092790446@infradead.org/ That page above also contains a lot more detailed explanation of the whole mechanism - Implement a new control flow integrity scheme called FineIBT which is based on the software kCFI implementation and uses hardware IBT support where present to annotate and track indirect branches using a hash to validate them - Other misc fixes and cleanups * tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits) x86/paravirt: Use common macro for creating simple asm paravirt functions x86/paravirt: Remove clobber bitmask from .parainstructions x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit x86/Kconfig: Enable kernel IBT by default x86,pm: Force out-of-line memcpy() objtool: Fix weak hole vs prefix symbol objtool: Optimize elf_dirty_reloc_sym() x86/cfi: Add boot time hash randomization x86/cfi: Boot time selection of CFI scheme x86/ibt: Implement FineIBT objtool: Add --cfi to generate the .cfi_sites section x86: Add prefix symbols for function padding objtool: Add option to generate prefix symbols objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf objtool: Slice up elf_create_section_symbol() kallsyms: Revert "Take callthunks into account" x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces x86/retpoline: Fix crash printing warning x86/paravirt: Fix a !PARAVIRT build warning ... |
||
![]() |
3ef3ace4e2 |
- Split MTRR and PAT init code to accomodate at least Xen PV and TDX
guests which do not get MTRRs exposed but only PAT. (TDX guests do not support the cache disabling dance when setting up MTRRs so they fall under the same category.) This is a cleanup work to remove all the ugly workarounds for such guests and init things separately (Juergen Gross) - Add two new Intel CPUs to the list of CPUs with "normal" Energy Performance Bias, leading to power savings - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern Centaur CPUs -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYhIMACgkQEsHwGGHe VUpxug//ZKw3hYFroKhsULJi/e0j2nGARiSlJrJcFHl2vgh9yGvDsnYUyM/rgjgt cM3uCLbEG7nA6uhB3nupzaXZ8lBM1nU9kiEl/kjQ5oYf9nmJ48fLttvWGfxYN4s3 kj5fYVhlOZpntQXIWrwxnPqghUysumMnZmBJeKYiYNNfkj62l3xU2Ni4Gnjnp02I 9MmUhl7pj1aEyOQfM8rovy+wtYCg5WTOmXVlyVN+b9MwfYeK+stojvCZHxtJs9BD fezpJjjG+78xKUC7vVZXCh1p1N5Qvj014XJkVl9Hg0n7qizKFZRtqi8I769G2ptd exP8c2nDXKCqYzE8vK6ukWgDANQPs3d6Z7EqUKuXOCBF81PnMPSUMyNtQFGNM6Wp S5YSvFfCgUjp50IunOpvkDABgpM+PB8qeWUq72UFQJSOymzRJg/KXtE2X+qaMwtC 0i6VLXfMddGcmqNKDppfGtCjq2W5VrNIIJedtAQQGyl+pl3XzZeNomhJpm/0mVfJ 8UrlXZeXl/EUQ7qk40gC/Ash27pU9ZDx4CMNMy1jDIQqgufBjEoRIDSFqQlghmZq An5/BqMLhOMxUYNA7bRUnyeyxCBypetMdQt5ikBmVXebvBDmArXcuSNAdiy1uBFX KD8P3Y1AnsHIklxkLNyZRUy7fb4mgMFenUbgc0vmbYHbFl0C0pQ= =Zmgh -----END PGP SIGNATURE----- Merge tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Borislav Petkov: - Split MTRR and PAT init code to accomodate at least Xen PV and TDX guests which do not get MTRRs exposed but only PAT. (TDX guests do not support the cache disabling dance when setting up MTRRs so they fall under the same category) This is a cleanup work to remove all the ugly workarounds for such guests and init things separately (Juergen Gross) - Add two new Intel CPUs to the list of CPUs with "normal" Energy Performance Bias, leading to power savings - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern Centaur CPUs * tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits) x86/mtrr: Make message for disabled MTRRs more descriptive x86/pat: Handle TDX guest PAT initialization x86/cpuid: Carve out all CPUID functionality x86/cpu: Switch to cpu_feature_enabled() for X86_FEATURE_XENPV x86/cpu: Remove X86_FEATURE_XENPV usage in setup_cpu_entry_area() x86/cpu: Drop 32-bit Xen PV guest code in update_task_stack() x86/cpu: Remove unneeded 64-bit dependency in arch_enter_from_user_mode() x86/cpufeatures: Add X86_FEATURE_XENPV to disabled-features.h x86/acpi/cstate: Optimize ARB_DISABLE on Centaur CPUs x86/mtrr: Simplify mtrr_ops initialization x86/cacheinfo: Switch cache_ap_init() to hotplug callback x86: Decouple PAT and MTRR handling x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init() x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init x86/mtrr: Get rid of __mtrr_enabled bool x86/mtrr: Simplify mtrr_bp_init() x86/mtrr: Remove set_all callback from struct mtrr_ops x86/mtrr: Disentangle MTRR init from PAT init x86/mtrr: Move cache control code to cacheinfo.c x86/mtrr: Split MTRR-specific handling from cache dis/enabling ... |
||
![]() |
b3883a9a1f |
stackprotector: move get_random_canary() into stackprotector.h
This has nothing to do with random.c and everything to do with stack protectors. Yes, it uses randomness. But many things use randomness. random.h and random.c are concerned with the generation of randomness, not with each and every use. So move this function into the more specific stackprotector.h file where it belongs. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
||
![]() |
30f89e524b |
x86/cacheinfo: Switch cache_ap_init() to hotplug callback
Instead of explicitly calling cache_ap_init() in identify_secondary_cpu() use a CPU hotplug callback instead. By registering the callback only after having started the non-boot CPUs and initializing cache_aps_delayed_init with "true", calling set_cache_aps_delayed_init() at boot time can be dropped. It should be noted that this change results in cache_ap_init() being called a little bit later when hotplugging CPUs. By using a new hotplug slot right at the start of the low level bringup this is not problematic, as no operations requiring a specific caching mode are performed that early in CPU initialization. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-15-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de> |
||
![]() |
0b9a6a8bed |
x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()
Instead of having a stop_machine() handler for either a specific MTRR register or all state at once, add a handler just for calling cache_cpu_init() if appropriate. Add functions for calling stop_machine() with this handler as well. Add a generic replacement for mtrr_bp_restore() and a wrapper for mtrr_bp_init(). Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-13-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de> |
||
![]() |
955d0e0805 |
x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init
In order to prepare decoupling MTRR and PAT replace the MTRR-specific mtrr_aps_delayed_init flag with a more generic cache_aps_delayed_init one. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-12-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de> |
||
![]() |
c063a217bc |
x86/percpu: Move current_top_of_stack next to current_task
Extend the struct pcpu_hot cacheline with current_top_of_stack; another very frequently used value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111145.493038635@infradead.org |
||
![]() |
e57ef2ed97 |
x86: Put hot per CPU variables into a struct
The layout of per-cpu variables is at the mercy of the compiler. This can lead to random performance fluctuations from build to build. Create a structure to hold some of the hottest per-cpu variables, starting with current_task. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111145.179707194@infradead.org |
||
![]() |
1f19e2d50b |
x86/cpu: Get rid of redundant switch_to_new_gdt() invocations
The only place where switch_to_new_gdt() is required is early boot to switch from the early GDT to the direct GDT. Any other invocation is completely redundant because it does not change anything. Secondary CPUs come out of the ASM code with GDT and GSBASE correctly set up. The same is true for XEN_PV. Remove all the voodoo invocations which are left overs from the ancient past, rename the function to switch_gdt_and_percpu_base() and mark it init. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111143.198076128@infradead.org |
||
![]() |
38bef8e57f |
smp: add set_nr_cpu_ids()
In preparation to support compile-time nr_cpu_ids, add a setter for the variable. This is a no-op for all arches. Signed-off-by: Yury Norov <yury.norov@gmail.com> |
||
![]() |
adbcaef840 |
x86/cacheinfo: move shared cache map definitions
Patch series "cpumask: Fix invalid uniprocessor assumptions", v4. On uniprocessor builds, it is currently assumed that any cpumask will contain the single CPU: cpu0. This assumption is used to provide optimised implementations. The current assumption also appears to be wrong, by ignoring the fact that users can provide empty cpumasks. This can result in bugs as explained in [1] - for_each_cpu() will run one iteration of the loop even when passed an empty cpumask. This series introduces some basic tests, and updates the optimisations for uniprocessor builds. The x86 patch was written after the kernel test robot [2] ran into a failed build. I have tried to list the files potentially affected by the changes to cpumask.h, in an attempt to find any other cases that fail on !SMP. I've gone through some of the files manually, and ran a few cross builds, but nothing else popped up. I (build) checked about half of the potientally affected files, but I do not have the resources to do them all. I hope we can fix other issues if/when they pop up later. [1] https://lore.kernel.org/all/20220530082552.46113-1-sander@svanheule.net/ [2] https://lore.kernel.org/all/202206060858.wA0FOzRy-lkp@intel.com/ This patch (of 5): The maps to keep track of shared caches between CPUs on SMP systems are declared in asm/smp.h, among them specifically cpu_llc_shared_map. These maps are externally defined in cpu/smpboot.c. The latter is only compiled on CONFIG_SMP=y, which means the declared extern symbols from asm/smp.h do not have a corresponding definition on uniprocessor builds. The inline cpu_llc_shared_mask() function from asm/smp.h refers to the map declaration mentioned above. This function is referenced in cacheinfo.c inside for_each_cpu() loop macros, to provide cpumask for the loop. On uniprocessor builds, the symbol for the cpu_llc_shared_map does not exist. However, the current implementation of for_each_cpu() also (wrongly) ignores the provided mask. By sheer luck, the compiler thus optimises out this unused reference to cpu_llc_shared_map, and the linker therefore does not require the cpu_llc_shared_mask to actually exist on uniprocessor builds. Only on SMP bulids does smpboot.o exist to provide the required symbols. To no longer rely on compiler optimisations for successful uniprocessor builds, move the definitions of cpu_llc_shared_map and cpu_l2c_shared_map from smpboot.c to cacheinfo.c. Link: https://lkml.kernel.org/r/cover.1656777646.git.sander@svanheule.net Link: https://lkml.kernel.org/r/e8167ddb570f56744a3dc12c2149a660a324d969.1656777646.git.sander@svanheule.net Signed-off-by: Sander Vanheule <sander@svanheule.net> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Marco Elver <elver@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Yury Norov <yury.norov@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
a13dc4d409 |
- Serious sanitization and cleanup of the whole APERF/MPERF and
frequency invariance code along with removing the need for unnecessary IPIs - Finally remove a.out support - The usual trivial cleanups and fixes all over x86 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLn48ACgkQEsHwGGHe VUpbkg/+PELrc0y/qxLM/+dyftKYY16Rhk6ZVAXfwqlh5ldyVQcLMUgKwDqYyTn2 XmgdI3cTcFlH2K7j6ANWLu0I9NPaviimUcEdMVcXt7aY5mGWk/q4hIyCYM8d41sV qKx4OjNSdyoofG6MtwFLJDuoeVg99Bqgvm4nP9BuxL0dZJ2hfcUZ7MTxYCx9ZYjK /3trx0NV287Yg/wm91EU0nLQzy9xbGS7WCmMnse6uxiUdm2vXbBt8oNFF4f747Dj 0cArfNrMgYq4Cv5bgt/Ki0NU/n4EOGDpJUSyQwlnjDKeN81ESPy7IWtTQ6cE/rJK BZeUIPiGiYHwtqXv0UTAPGLG8cAqKeab8u0xAOyrFVDkTc0+WlPJRsUAOmRRGIGE M8ZjoxrLeuFgxw6vKpVjaA+mDRj3qEpSH+IrTcekS98PN7gmVzvq03GobgGbT7YB xmtbThJa+514FfUVckkyC0+A56BknUIgVxwFPqrthE2atzYTbH67hW4U0yVWXXr7 2VI7ttozBrYVgHCWhD9eoT0uhyD74Vl6pqHnqzY9ShIfKVUGvMgKHHg04nLLtF7W hm87xV3Q5UEmXhTmDzT1rUZ99mBUxGbWxk227I9raMugIh7pp9wIr57+7O0LRYfX TdnE2+tL8RMi7+XzRH5iLhnwkrvahBESeHSQ7GVI1Y2zMmmFN+0= =Dks/ -----END PGP SIGNATURE----- Merge tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: - Serious sanitization and cleanup of the whole APERF/MPERF and frequency invariance code along with removing the need for unnecessary IPIs - Finally remove a.out support - The usual trivial cleanups and fixes all over x86 * tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) x86: Remove empty files x86/speculation: Add missing srbds=off to the mitigations= help text x86/prctl: Remove pointless task argument x86/aperfperf: Make it correct on 32bit and UP kernels x86/aperfmperf: Integrate the fallback code from show_cpuinfo() x86/aperfmperf: Replace arch_freq_get_on_cpu() x86/aperfmperf: Replace aperfmperf_get_khz() x86/aperfmperf: Store aperf/mperf data for cpu frequency reads x86/aperfmperf: Make parts of the frequency invariance code unconditional x86/aperfmperf: Restructure arch_scale_freq_tick() x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct x86/aperfmperf: Untangle Intel and AMD frequency invariance init x86/aperfmperf: Separate AP/BP frequency invariance init x86/smp: Move APERF/MPERF code where it belongs x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu() x86/process: Fix kernel-doc warning due to a changed function name x86: Remove a.out support x86/mm: Replace nodes_weight() with nodes_empty() where appropriate x86: Replace cpumask_weight() with cpumask_empty() where appropriate x86/pkeys: Remove __arch_set_user_pkey_access() declaration ... |
||
![]() |
3a755ebcc2 |
Intel Trust Domain Extensions
This is the Intel version of a confidential computing solution called Trust Domain Extensions (TDX). This series adds support to run the kernel as part of a TDX guest. It provides similar guest protections to AMD's SEV-SNP like guest memory and register state encryption, memory integrity protection and a lot more. Design-wise, it differs from AMD's solution considerably: it uses a software module which runs in a special CPU mode called (Secure Arbitration Mode) SEAM. As the name suggests, this module serves as sort of an arbiter which the confidential guest calls for services it needs during its lifetime. Just like AMD's SNP set, this series reworks and streamlines certain parts of x86 arch code so that this feature can be properly accomodated. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLbisACgkQEsHwGGHe VUqZLg/7B55iygCwzz0W/KLcXL2cISatUpzGbFs1XTbE9DMz06BPkOsEjF2k8ckv kfZjgqhSx3GvUI80gK0Tn2M2DfIj3nKuNSXd1pfextP7AxEf68FFJsQz1Ju7bHpT pZaG+g8IK4+mnEHEKTCO9ANg/Zw8yqJLdtsCaCNE9SUGUfQ6m/ujTEfsambXDHNm khyCAgpIGSOt51/4apoR9ebyrNCaeVbDawpIPjTy+iyFRc/WyaLFV9CQ8klw4gbw r/90x2JYxvAf0/z/ifT9Wa+TnYiQ0d4VjFbfr0iJ4GcPn5L3EIoIKPE8vPGMpoSX fLSzoNmAOT3ja57ytUUQ3o0edoRUIPEdixOebf9qWvE/aj7W37YRzrlJ8Ej/x9Jy HcI4WZF6Dr1bh6FnI/xX2eVZRzLOL4j9gNyPCwIbvgr1NjDqQnxU7nhxVMmQhJrs IdiEcP5WYerLKfka/uF//QfWUg5mDBgFa1/3xK57Z3j0iKWmgjaPpR0SWlOKjj8G tr0gGN9ejikZTqXKGsHn8fv/R3bjXvbVD8z0IEcx+MIrRmZPnX2QBlg7UA1AXV5n HoVwPFdH1QAtjZq1MRcL4hTOjz3FkS68rg7ZH0f2GWJAzWmEGytBIhECRnN/PFFq VwRB4dCCt0bzqRxkiH5lzdgR+xqRe61juQQsMzg+Flv/trpXDqM= =ac9K -----END PGP SIGNATURE----- Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull Intel TDX support from Borislav Petkov: "Intel Trust Domain Extensions (TDX) support. This is the Intel version of a confidential computing solution called Trust Domain Extensions (TDX). This series adds support to run the kernel as part of a TDX guest. It provides similar guest protections to AMD's SEV-SNP like guest memory and register state encryption, memory integrity protection and a lot more. Design-wise, it differs from AMD's solution considerably: it uses a software module which runs in a special CPU mode called (Secure Arbitration Mode) SEAM. As the name suggests, this module serves as sort of an arbiter which the confidential guest calls for services it needs during its lifetime. Just like AMD's SNP set, this series reworks and streamlines certain parts of x86 arch code so that this feature can be properly accomodated" * tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) x86/tdx: Fix RETs in TDX asm x86/tdx: Annotate a noreturn function x86/mm: Fix spacing within memory encryption features message x86/kaslr: Fix build warning in KASLR code in boot stub Documentation/x86: Document TDX kernel architecture ACPICA: Avoid cache flush inside virtual machines x86/tdx/ioapic: Add shared bit for IOAPIC base address x86/mm: Make DMA memory shared for TD guest x86/mm/cpa: Add support for TDX shared memory x86/tdx: Make pages shared in ioremap() x86/topology: Disable CPU online/offline control for TDX guests x86/boot: Avoid #VE during boot for TDX platforms x86/boot: Set CR0.NE early and keep it set during the boot x86/acpi/x86/boot: Add multiprocessor wake-up support x86/boot: Add a trampoline for booting APs via firmware handoff x86/tdx: Wire up KVM hypercalls x86/tdx: Port I/O: Add early boot support x86/tdx: Port I/O: Add runtime hypercalls x86/boot: Port I/O: Add decompression-time support for TDX x86/boot: Port I/O: Allow to hook up alternative helpers ... |
||
![]() |
bb6e89df90 |
x86/aperfmperf: Make parts of the frequency invariance code unconditional
The frequency invariance support is currently limited to x86/64 and SMP, which is the vast majority of machines. arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF and MPERF MSRs. The CPU frequency getters function do the same via dedicated IPIs. While it could be argued that on systems where frequency invariance support is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can be avoided, it does not make sense to keep the extra code and the resulting runtime issues of mass IPIs around. As a first step split out the non frequency invariance specific initialization code and the read MSR portion of arch_scale_freq_tick(). The rest of the code is still conditional and guarded with a static key. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20220415161206.761988704@linutronix.de |
||
![]() |
0dfaf3f6ec |
x86/aperfmperf: Untangle Intel and AMD frequency invariance init
AMD boot CPU initialization happens late via ACPI/CPPC which prevents the Intel parts from being marked __init. Split out the common code and provide a dedicated interface for the AMD initialization and mark the Intel specific code and data __init. The remaining text size is almost cut in half: text: 2614 -> 1350 init.text: 0 -> 786 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20220415161206.592465719@linutronix.de |
||
![]() |
138a7f9c6b |
x86/aperfmperf: Separate AP/BP frequency invariance init
This code is convoluted and because it can be invoked post init via the ACPI/CPPC code, all of the initialization functionality is built in instead of being part of init text and init data. As a first step create separate calls for the boot and the application processors. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20220415161206.536733494@linutronix.de |
||
![]() |
55cb0b7074 |
x86/smp: Move APERF/MPERF code where it belongs
as this can share code with the preexisting APERF/MPERF code. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20220415161206.478362457@linutronix.de |
||
![]() |
ff2e64684f |
x86/boot: Add a trampoline for booting APs via firmware handoff
Historically, x86 platforms have booted secondary processors (APs) using INIT followed by the start up IPI (SIPI) messages. In regular VMs, this boot sequence is supported by the VMM emulation. But such a wakeup model is fatal for secure VMs like TDX in which VMM is an untrusted entity. To address this issue, a new wakeup model was added in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot the APs. More details about this wakeup model can be found in ACPI specification v6.4, the section titled "Multiprocessor Wakeup Structure". Since the existing trampoline code requires processors to boot in real mode with 16-bit addressing, it will not work for this wakeup model (because it boots the AP in 64-bit mode). To handle it, extend the trampoline code to support 64-bit mode firmware handoff. Also, extend IDT and GDT pointers to support 64-bit mode hand off. There is no TDX-specific detection for this new boot method. The kernel will rely on it as the sole boot method whenever the new ACPI structure is present. The ACPI table parser for the MADT multiprocessor wake up structure and the wakeup method that uses this structure will be added by the following patch in this series. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-21-kirill.shutemov@linux.intel.com |
||
![]() |
0afb6b660a |
x86/sev: Use SEV-SNP AP creation to start secondary CPUs
To provide a more secure way to start APs under SEV-SNP, use the SEV-SNP AP Creation NAE event. This allows for guest control over the AP register state rather than trusting the hypervisor with the SEV-ES Jump Table address. During native_smp_prepare_cpus(), invoke an SEV-SNP function that, if SEV-SNP is active, will set/override apic->wakeup_secondary_cpu. This will allow the SEV-SNP AP Creation NAE event method to be used to boot the APs. As a result of installing the override when SEV-SNP is active, this method of starting the APs becomes the required method. The override function will fail to start the AP if the hypervisor does not have support for AP creation. [ bp: Work in forgotten review comments. ] Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220307213356.2797205-23-brijesh.singh@amd.com |
||
![]() |
eb5616d4ad |
x86/ACPI: CPPC: Move init_freq_invariance_cppc() into x86 CPPC
The init_freq_invariance_cppc code actually doesn't need the SMP functionality. So setting the CONFIG_SMP as the check condition for init_freq_invariance_cppc may cause the confusion to misunderstand the CPPC. And the x86 CPPC file is better space to store the CPPC related functions, while the init_freq_invariance_cppc is out of smpboot, that means, the CONFIG_SMP won't be mandatory condition any more. And It's more clear than before. Signed-off-by: Huang Rui <ray.huang@amd.com> [ rjw: Subject adjustment ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
![]() |
666f6ecf35 |
x86: Expose init_freq_invariance() to topology header
The function init_freq_invariance will be used on x86 CPPC, so expose it in the topology header. Signed-off-by: Huang Rui <ray.huang@amd.com> [ rjw: Subject adjustment ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
![]() |
82d8936914 |
x86/ACPI: CPPC: Move AMD maximum frequency ratio setting function into x86 CPPC
The AMD maximum frequency ratio setting function depends on CPPC, so the x86 CPPC implementation file is better space for this function. Signed-off-by: Huang Rui <ray.huang@amd.com> [ rjw: Subject adjustment ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
![]() |
cabdc3a847 |
sched,x86: Don't use cluster topology for x86 hybrid CPUs
For x86 hybrid CPUs like Alder Lake, the order of CPU selection should
be based strictly on CPU priority. Don't include cluster topology for
hybrid CPUs to avoid interference with such CPU selection order.
On Alder Lake, the Atom CPU cluster has more capacity (4 Atom CPUs) vs
Big core cluster (2 hyperthread CPUs). This could potentially bias CPU
selection towards Atom over Big Core, when Big core CPU has higher
priority.
Fixes:
|
||
![]() |
ce2612b670 |
x86/smp: Factor out parts of native_smp_prepare_cpus()
Commit |
||
![]() |
18398bb825 |
The usual round of random minor fixes and cleanups all over the place.
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/ucwACgkQEsHwGGHe VUpLMQ//d4xim4zD4hQVleYkGWqA2nB050QtutIto1nvsiZdjrUSMjJGZnos2nLd 9tY3NZtgrFfAyjUkal098L+2zed+U6UemIV6kT1F3TnWg4dYByxYABNutOsQGUgw o4sTwjG7ELC273yPt/WY9TMwfMCiX7t80QkjoSeWbkApdfB0aZoxB0CvdLBKwCl/ bxdfX1uvqW7sc6fatcI634hC1HDw8GJThym4/lrMHq2Pr8n/U6pEWoBFsdlprnLk pqb3IGX3kNnpjTmCpZxvd4ZQV8xUlMcJkdEFjKDf7BLtWjwZxPIdGcfnxrpf2EJQ yVZklcabaBNz/zNkoQeyD6Ix1ZCFSxcHRhg0BJpvvhzQ91My2pGZgLuzUYz3Fk7G GjWZje8WZcL3ViL9oGbOYMLSw76wov95+8WMiyKqPaNuzZbS3py5C/ThgqpCdg5b WyQe0GhUvthzLsVz9Gu7OFrbZl6VBz8q7/bxuo+vpFhgC1EiOj2yPSZNUJBRKdcd cFSfybcjk3Qyf7YXmZ/NcD9TQARQO1ediRY6ZNeZr7JYPzyebY+wTfHqDvdX65S5 i/zgeAX4XAuX4pl28nJvDe8x1P7t5T8L6Qno9Lnd1xMG7jWift9RSEOo29rUp0sw gA9xV/BsmApvyM8pgD/lAqxAFzGkYfSy8bB6uav8HccHprVfJE0= =4BVs -----END PGP SIGNATURE----- Merge tag 'x86_cleanups_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: "The usual round of random minor fixes and cleanups all over the place" * tag 'x86_cleanups_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/Makefile: Remove unneeded whitespaces before tabs x86/of: Kill unused early_init_dt_scan_chosen_arch() x86: Fix misspelled Kconfig symbols x86/Kconfig: Remove references to obsolete Kconfig symbols x86/smp: Remove unnecessary assignment to local var freq_scale |
||
![]() |
8cb1ae19bf |
x86/fpu updates:
- Cleanup of extable fixup handling to be more robust, which in turn allows to make the FPU exception fixups more robust as well. - Change the return code for signal frame related failures from explicit error codes to a boolean fail/success as that's all what the calling code evaluates. - A large refactoring of the FPU code to prepare for adding AMX support: - Distangle the public header maze and remove especially the misnomed kitchen sink internal.h which is despite it's name included all over the place. - Add a proper abstraction for the register buffer storage (struct fpstate) which allows to dynamically size the buffer at runtime by flipping the pointer to the buffer container from the default container which is embedded in task_struct::tread::fpu to a dynamically allocated container with a larger register buffer. - Convert the code over to the new fpstate mechanism. - Consolidate the KVM FPU handling by moving the FPU related code into the FPU core which removes the number of exports and avoids adding even more export when AMX has to be supported in KVM. This also removes duplicated code which was of course unnecessary different and incomplete in the KVM copy. - Simplify the KVM FPU buffer handling by utilizing the new fpstate container and just switching the buffer pointer from the user space buffer to the KVM guest buffer when entering vcpu_run() and flipping it back when leaving the function. This cuts the memory requirements of a vCPU for FPU buffers in half and avoids pointless memory copy operations. This also solves the so far unresolved problem of adding AMX support because the current FPU buffer handling of KVM inflicted a circular dependency between adding AMX support to the core and to KVM. With the new scheme of switching fpstate AMX support can be added to the core code without affecting KVM. - Replace various variables with proper data structures so the extra information required for adding dynamically enabled FPU features (AMX) can be added in one place - Add AMX (Advanved Matrix eXtensions) support (finally): AMX is a large XSTATE component which is going to be available with Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD) which allows to trap the (first) use of an AMX related instruction, which has two benefits: 1) It allows the kernel to control access to the feature 2) It allows the kernel to dynamically allocate the large register state buffer instead of burdening every task with the the extra 8K or larger state storage. It would have been great to gain this kind of control already with AVX512. The support comes with the following infrastructure components: 1) arch_prctl() to - read the supported features (equivalent to XGETBV(0)) - read the permitted features for a task - request permission for a dynamically enabled feature Permission is granted per process, inherited on fork() and cleared on exec(). The permission policy of the kernel is restricted to sigaltstack size validation, but the syscall obviously allows further restrictions via seccomp etc. 2) A stronger sigaltstack size validation for sys_sigaltstack(2) which takes granted permissions and the potentially resulting larger signal frame into account. This mechanism can also be used to enforce factual sigaltstack validation independent of dynamic features to help with finding potential victims of the 2K sigaltstack size constant which is broken since AVX512 support was added. 3) Exception handling for #NM traps to catch first use of a extended feature via a new cause MSR. If the exception was caused by the use of such a feature, the handler checks permission for that feature. If permission has not been granted, the handler sends a SIGILL like the #UD handler would do if the feature would have been disabled in XCR0. If permission has been granted, then a new fpstate which fits the larger buffer requirement is allocated. In the unlikely case that this allocation fails, the handler sends SIGSEGV to the task. That's not elegant, but unavoidable as the other discussed options of preallocation or full per task permissions come with their own set of horrors for kernel and/or userspace. So this is the lesser of the evils and SIGSEGV caused by unexpected memory allocation failures is not a fundamentally new concept either. When allocation succeeds, the fpstate properties are filled in to reflect the extended feature set and the resulting sizes, the fpu::fpstate pointer is updated accordingly and the trap is disarmed for this task permanently. 4) Enumeration and size calculations 5) Trap switching via MSR_XFD The XFD (eXtended Feature Disable) MSR is context switched with the same life time rules as the FPU register state itself. The mechanism is keyed off with a static key which is default disabled so !AMX equipped CPUs have zero overhead. On AMX enabled CPUs the overhead is limited by comparing the tasks XFD value with a per CPU shadow variable to avoid redundant MSR writes. In case of switching from a AMX using task to a non AMX using task or vice versa, the extra MSR write is obviously inevitable. All other places which need to be aware of the variable feature sets and resulting variable sizes are not affected at all because they retrieve the information (feature set, sizes) unconditonally from the fpstate properties. 6) Enable the new AMX states Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now. The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words... Many thanks to Chang Bae and Dave Hansen for working hard on this and also to the various test teams at Intel who reserved extra capacity to follow the rapid development of this closely which provides the confidence level required to offer this rather large update for inclusion into 5.16-rc1. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/NkITHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYodDkEADH4+/nN/QoSUHIuuha5Zptj3g2b16a /3TxT9fhwPen/kzMGsUk70s3iWJMA+I5dCfkSZexJ2hfhcRe9cBzZIa1HCawKwf3 YCISTsO/M+LpeORuZ+TpfFLJKnxNr1SEOl+EYffGhq0AkCjifb9Cnr0JZuoMUzGU jpfJZ2bj28ri5lG812DtzSMBM9E3SAwgJv+GNjmZbxZKb9mAfhbAMdBUXHirX7Ej jmx6koQjYOKwYIW8w1BrdC270lUKQUyJTbQgdRkN9Mh/HnKyFixQ18JqGlgaV2cT EtYePUfTEdaHdAhUINLIlEug1MfOslHU+HyGsdywnoChNB4GHPQuePC5Tz60VeFN RbQ9aKcBUu8r95rjlnKtAtBijNMA4bjGwllVxNwJ/ZoA9RPv1SbDZ07RX3qTaLVY YhVQl8+shD33/W24jUTJv1kMMexpHXIlv0gyfMryzpwI7uzzmGHRPAokJdbYKctC dyMPfdE90rxTiMUdL/1IQGhnh3awjbyfArzUhHyQ++HyUyzCFh0slsO0CD18vUy8 FofhCugGBhjuKw3XwLNQ+KsWURz5qHctSzBc3qMOSyqFHbAJCVRANkhsFvWJo2qL 75+Z7OTRebtsyOUZIdq26r4roSxHrps3dupWTtN70HWx2NhQG1nLEw986QYiQu1T hcKvDmehQLrUvg== =x3WL -----END PGP SIGNATURE----- Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: - Cleanup of extable fixup handling to be more robust, which in turn allows to make the FPU exception fixups more robust as well. - Change the return code for signal frame related failures from explicit error codes to a boolean fail/success as that's all what the calling code evaluates. - A large refactoring of the FPU code to prepare for adding AMX support: - Distangle the public header maze and remove especially the misnomed kitchen sink internal.h which is despite it's name included all over the place. - Add a proper abstraction for the register buffer storage (struct fpstate) which allows to dynamically size the buffer at runtime by flipping the pointer to the buffer container from the default container which is embedded in task_struct::tread::fpu to a dynamically allocated container with a larger register buffer. - Convert the code over to the new fpstate mechanism. - Consolidate the KVM FPU handling by moving the FPU related code into the FPU core which removes the number of exports and avoids adding even more export when AMX has to be supported in KVM. This also removes duplicated code which was of course unnecessary different and incomplete in the KVM copy. - Simplify the KVM FPU buffer handling by utilizing the new fpstate container and just switching the buffer pointer from the user space buffer to the KVM guest buffer when entering vcpu_run() and flipping it back when leaving the function. This cuts the memory requirements of a vCPU for FPU buffers in half and avoids pointless memory copy operations. This also solves the so far unresolved problem of adding AMX support because the current FPU buffer handling of KVM inflicted a circular dependency between adding AMX support to the core and to KVM. With the new scheme of switching fpstate AMX support can be added to the core code without affecting KVM. - Replace various variables with proper data structures so the extra information required for adding dynamically enabled FPU features (AMX) can be added in one place - Add AMX (Advanced Matrix eXtensions) support (finally): AMX is a large XSTATE component which is going to be available with Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD) which allows to trap the (first) use of an AMX related instruction, which has two benefits: 1) It allows the kernel to control access to the feature 2) It allows the kernel to dynamically allocate the large register state buffer instead of burdening every task with the the extra 8K or larger state storage. It would have been great to gain this kind of control already with AVX512. The support comes with the following infrastructure components: 1) arch_prctl() to - read the supported features (equivalent to XGETBV(0)) - read the permitted features for a task - request permission for a dynamically enabled feature Permission is granted per process, inherited on fork() and cleared on exec(). The permission policy of the kernel is restricted to sigaltstack size validation, but the syscall obviously allows further restrictions via seccomp etc. 2) A stronger sigaltstack size validation for sys_sigaltstack(2) which takes granted permissions and the potentially resulting larger signal frame into account. This mechanism can also be used to enforce factual sigaltstack validation independent of dynamic features to help with finding potential victims of the 2K sigaltstack size constant which is broken since AVX512 support was added. 3) Exception handling for #NM traps to catch first use of a extended feature via a new cause MSR. If the exception was caused by the use of such a feature, the handler checks permission for that feature. If permission has not been granted, the handler sends a SIGILL like the #UD handler would do if the feature would have been disabled in XCR0. If permission has been granted, then a new fpstate which fits the larger buffer requirement is allocated. In the unlikely case that this allocation fails, the handler sends SIGSEGV to the task. That's not elegant, but unavoidable as the other discussed options of preallocation or full per task permissions come with their own set of horrors for kernel and/or userspace. So this is the lesser of the evils and SIGSEGV caused by unexpected memory allocation failures is not a fundamentally new concept either. When allocation succeeds, the fpstate properties are filled in to reflect the extended feature set and the resulting sizes, the fpu::fpstate pointer is updated accordingly and the trap is disarmed for this task permanently. 4) Enumeration and size calculations 5) Trap switching via MSR_XFD The XFD (eXtended Feature Disable) MSR is context switched with the same life time rules as the FPU register state itself. The mechanism is keyed off with a static key which is default disabled so !AMX equipped CPUs have zero overhead. On AMX enabled CPUs the overhead is limited by comparing the tasks XFD value with a per CPU shadow variable to avoid redundant MSR writes. In case of switching from a AMX using task to a non AMX using task or vice versa, the extra MSR write is obviously inevitable. All other places which need to be aware of the variable feature sets and resulting variable sizes are not affected at all because they retrieve the information (feature set, sizes) unconditonally from the fpstate properties. 6) Enable the new AMX states Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now. The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words... Many thanks to Chang Bae and Dave Hansen for working hard on this and also to the various test teams at Intel who reserved extra capacity to follow the rapid development of this closely which provides the confidence level required to offer this rather large update for inclusion into 5.16-rc1 * tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits) Documentation/x86: Add documentation for using dynamic XSTATE features x86/fpu: Include vmalloc.h for vzalloc() selftests/x86/amx: Add context switch test selftests/x86/amx: Add test cases for AMX state management x86/fpu/amx: Enable the AMX feature in 64-bit mode x86/fpu: Add XFD handling for dynamic states x86/fpu: Calculate the default sizes independently x86/fpu/amx: Define AMX state components and have it used for boot-time checks x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers x86/fpu/xstate: Add fpstate_realloc()/free() x86/fpu/xstate: Add XFD #NM handler x86/fpu: Update XFD state where required x86/fpu: Add sanity checks for XFD x86/fpu: Add XFD state to fpstate x86/msr-index: Add MSRs for XFD x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit x86/fpu: Reset permission and fpstate on exec() x86/fpu: Prepare fpu_clone() for dynamically enabled features x86/fpu/signal: Prepare for variable sigframe length x86/signal: Use fpu::__state_user_size for sigalt stack validation ... |
||
![]() |
55409ac5c3 |
sched,x86: Fix L2 cache mask
Currently AMD/Hygon do not populate l2c_id, this means that for SMT enabled systems they report an L2 per thread. This is ofcourse not true but was harmless so far. However, since commit: |
||
![]() |
b56d2795b2 |
x86/fpu: Replace the includes of fpu/internal.h
Now that the file is empty, fixup all references with the proper includes and delete the former kitchen sink. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011540.001197214@linutronix.de |
||
![]() |
66558b730f |
sched: Add cluster scheduler level for x86
There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is shared among a cluster of cores instead of being exclusive to one single core. To prevent oversubscription of L2 cache, load should be balanced between such L2 clusters, especially for tasks with no shared data. On benchmark such as SPECrate mcf test, this change provides a boost to performance especially on medium load system on Jacobsville. on a Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4 cores each, the benchmark number is as follow: Improvement over baseline kernel for mcf_r copies run time base rate 1 -0.1% -0.2% 6 25.1% 25.1% 12 18.8% 19.0% 24 0.3% 0.3% So this looks pretty good. In terms of the system's task distribution, some pretty bad clumping can be seen for the vanilla kernel without the L2 cluster domain for the 6 and 12 copies case. With the extra domain for cluster, the load does get evened out between the clusters. Note this patch isn't an universal win as spreading isn't necessarily a win, particually for those workload who can benefit from packing. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com |
||
![]() |
85784470ef |
x86/smp: Remove unnecessary assignment to local var freq_scale
Coverity warns of an unused value in arch_scale_freq_tick():
CID 100778 (#1 of 1): Unused value (UNUSED_VALUE)
assigned_value: Assigning value 1024ULL to freq_scale here, but that stored
value is overwritten before it can be used.
It was introduced by commit:
|
||
![]() |
c52787b590 |
x86/smp: Add a per-cpu view of SMT state
A new field smt_active in cpuinfo_x86 identifies if the current core/cpu is in SMT mode or not. This is helpful when the system has some of its cores with threads offlined and can be used for cases where action is taken based on the state of SMT. The upcoming support for paranoid L1D flush will make use of this information. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Balbir Singh <sblbir@amazon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210108121056.21940-2-sblbir@amazon.com |
||
![]() |
e5a0fc4e20 |
CPU setup code changes:
- Clean up & simplify AP exception handling setup. - Consolidate the disjoint IDT setup code living in idt_setup_traps() and idt_setup_ist_traps() into a single idt_setup_traps() initialization function and call it before cpu_init(). Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZdu0RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gWgA//ecu2FqCFT2gpZP7ABdJqAhtYA8f7rg/a BMKNNfTha/L9Dot2PSqMq8oVi8NA++EbKjuWeFujAIoU/2vso+YrmHc84O35nOOF u2Zgps7UK+ffKo11Yu1Vpb911ctJClAwL0CemsC30QDbpGVHPecLuxOIgY6E6BwC qhNqNLp0K4bFRq0ya27O8RPiz9LjCzUILHHvWSAl5m5tWqovED8aXdjrDJcFXqwY u9nuuRpUpQWqCldZP9X7+pdo4Z2HZjvIBjqHD/wl3VMjV6q+k+su6AjV9p1D8hoz otY96i8MQjD/sgIa1H+tUc2ZusGzDls+EpYiGaPmqeXMitKEwOFpVDAaT8SelUms bR4VQ9IYB1NG7Qbco3NQHMV1sWuvUJcLG6ILYFWXgH0hP1EDHFn/TvOn0rfJysbE AmCpwmUo0b8Bj6nbKkVcXxoX1FdeqiM5+cPxHxGVgxVoR0Umz13EX4y4cBzSIRht eYwT6H1CxR9a4TIr8cMBsN14QsnV3f6lv/RNfVdmZEJVVr0boRI90L2xMLBB9RkP z03g7VvfMuSWnKyOFheP4ae9ul2qxAT380+g1oHQH0XIFtj9yIhzJHpoUCzhgCra Ui2Z71Dhq0R1UNpPsPfc1XkQI9chiahn8gc1u2zvN4SzZa6DZH22VvGNK0ghoIxq 5WFho50hNIk= =BPbv -----END PGP SIGNATURE----- Merge tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 exception handling updates from Ingo Molnar: - Clean up & simplify AP exception handling setup. - Consolidate the disjoint IDT setup code living in idt_setup_traps() and idt_setup_ist_traps() into a single idt_setup_traps() initialization function and call it before cpu_init(). * tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/idt: Rework IDT setup for boot CPU x86/cpu: Init AP exception handling from cpu_init_secondary() |
||
![]() |
a9e906b71f |
Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
b1efd0ff4b |
x86/cpu: Init AP exception handling from cpu_init_secondary()
SEV-ES guests require properly setup task register with which the TSS descriptor in the GDT can be located so that the IST-type #VC exception handler which they need to function properly, can be executed. This setup needs to happen before attempting to load microcode in ucode_cpu_init() on secondary CPUs which can cause such #VC exceptions. Simplify the machinery by running that exception setup from a new function cpu_init_secondary() and explicitly call cpu_init_exception_handling() for the boot CPU before cpu_init(). The latter prepares for fixing and simplifying the exception/IST setup on the boot CPU. There should be no functional changes resulting from this patch. [ tglx: Reworked it so cpu_init_exception_handling() stays seperate ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/87k0o6gtvu.ffs@nanos.tec.linutronix.de |
||
![]() |
3743d55b28 |
x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations
Some AMD Ryzen generations has different calculation method on maximum performance. 255 is not for all ASICs, some specific generations should use 166 as the maximum performance. Otherwise, it will report incorrect frequency value like below: ~ → lscpu | grep MHz CPU MHz: 3400.000 CPU max MHz: 7228.3198 CPU min MHz: 2200.0000 [ mingo: Tidied up whitespace use. ] [ Alexander Monakov <amonakov@ispras.ru>: fix 225 -> 255 typo. ] Fixes: |
||
![]() |
f1a0a376ca |
sched/core: Initialize the idle task with preemption disabled
As pointed out by commit
|
||
![]() |
3cf4524ce4 |
x86/smpboot: Remove duplicate includes
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210427063835.9039-1-wanjiabing@vivo.com |
||
![]() |
c6536676c7 |
- turn the stack canary into a normal __percpu variable on 32-bit which
gets rid of the LAZY_GS stuff and a lot of code. - Add an insn_decode() API which all users of the instruction decoder should preferrably use. Its goal is to keep the details of the instruction decoder away from its users and simplify and streamline how one decodes insns in the kernel. Convert its users to it. - kprobes improvements and fixes - Set the maximum DIE per package variable on Hygon - Rip out the dynamic NOP selection and simplify all the machinery around selecting NOPs. Use the simplified NOPs in objtool now too. - Add Xeon Sapphire Rapids to list of CPUs that support PPIN - Simplify the retpolines by folding the entire thing into an alternative now that objtool can handle alternatives with stack ops. Then, have objtool rewrite the call to the retpoline with the alternative which then will get patched at boot time. - Document Intel uarch per models in intel-family.h - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the exception on Intel. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCHyJQACgkQEsHwGGHe VUpjiRAAwPZdwwp08ypZuMHR4EhLNru6gYhbAoALGgtYnQjLtn5onQhIeieK+R4L cmZpxHT9OFp5dXHk4kwygaQBsD4pPOiIpm60kye1dN3cSbOORRdkwEoQMpKMZ+5Y kvVsmn7lrwRbp600KdE4G6L5+N6gEgr0r6fMFWWGK3mgVAyCzPexVHgydcp131ch iYMo6/pPDcNkcV/hboVKgx7GISdQ7L356L1MAIW/Sxtw6uD/X4qGYW+kV2OQg9+t nQDaAo7a8Jqlop5W5TQUdMLKQZ1xK8SFOSX/nTS15DZIOBQOGgXR7Xjywn1chBH/ PHLwM5s4XF6NT5VlIA8tXNZjWIZTiBdldr1kJAmdDYacrtZVs2LWSOC0ilXsd08Z EWtvcpHfHEqcuYJlcdALuXY8xDWqf6Q2F7BeadEBAxwnnBg+pAEoLXI/1UwWcmsj wpaZTCorhJpYo2pxXckVdHz2z0LldDCNOXOjjaWU8tyaOBKEK6MgAaYU7e0yyENv mVc9n5+WuvXuivC6EdZ94Pcr/KQsd09ezpJYcVfMDGv58YZrb6XIEELAJIBTu2/B Ua8QApgRgetx+1FKb8X6eGjPl0p40qjD381TADb4rgETPb1AgKaQflmrSTIik+7p O+Eo/4x/GdIi9jFk3K+j4mIznRbUX0cheTJgXoiI4zXML9Jv94w= =bm4S -----END PGP SIGNATURE----- Merge tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 updates from Borislav Petkov: - Turn the stack canary into a normal __percpu variable on 32-bit which gets rid of the LAZY_GS stuff and a lot of code. - Add an insn_decode() API which all users of the instruction decoder should preferrably use. Its goal is to keep the details of the instruction decoder away from its users and simplify and streamline how one decodes insns in the kernel. Convert its users to it. - kprobes improvements and fixes - Set the maximum DIE per package variable on Hygon - Rip out the dynamic NOP selection and simplify all the machinery around selecting NOPs. Use the simplified NOPs in objtool now too. - Add Xeon Sapphire Rapids to list of CPUs that support PPIN - Simplify the retpolines by folding the entire thing into an alternative now that objtool can handle alternatives with stack ops. Then, have objtool rewrite the call to the retpoline with the alternative which then will get patched at boot time. - Document Intel uarch per models in intel-family.h - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the exception on Intel. * tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) x86, sched: Treat Intel SNC topology as default, COD as exception x86/cpu: Comment Skylake server stepping too x86/cpu: Resort and comment Intel models objtool/x86: Rewrite retpoline thunk calls objtool: Skip magical retpoline .altinstr_replacement objtool: Cache instruction relocs objtool: Keep track of retpoline call sites objtool: Add elf_create_undef_symbol() objtool: Extract elf_symbol_add() objtool: Extract elf_strtab_concat() objtool: Create reloc sections implicitly objtool: Add elf_create_reloc() helper objtool: Rework the elf_rebuild_reloc_section() logic objtool: Fix static_call list generation objtool: Handle per arch retpoline naming objtool: Correctly handle retpoline thunk calls x86/retpoline: Simplify retpolines x86/alternatives: Optimize optimize_nops() x86: Add insn_decode_kernel() x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declaration ... |
||
![]() |
ea5bc7b977 |
Trivial cleanups and fixes all over the place.
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGmYIACgkQEsHwGGHe VUr45w/8CSXr7MXaFBj4To0hTWJXSZyF6YGqlZOSJXFcFh4cWTNwfVOoFaV47aDo +HsCNTkGENcKhLrDUWDRiG/Uo46jxtOtl1vhq7U4pGemSYH871XWOKfb5k5XNMwn /uhaHMI4aEfd6bUFnF518NeyRIsD0BdqFj4tB7RbAiyFwdETDX9Tkj/uBKnQ4zon 4tEDoXgThuK5YKK9zVQg5pa7aFp2zg1CAdX/WzBkS8BHVBPXSV0CF97AJYQOM/V+ lUHv+BN3wp97GYHPQMPsbkNr8IuFoe2mIvikwjxg8iOFpzEU1G1u09XV9R+PXByX LclFTRqK/2uU5hJlcsBiKfUuidyErYMRYImbMAOREt2w0ogWVu2zQ7HkjVve25h1 sQPwPudbAt6STbqRxvpmB3yoV4TCYwnF91FcWgEy+rcEK2BDsHCnScA45TsK5I1C kGR1K17pHXprgMZFPveH+LgxewB6smDv+HllxQdSG67LhMJXcs2Epz0TsN8VsXw8 dlD3lGReK+5qy9FTgO7mY0xhiXGz1IbEdAPU4eRBgih13puu03+jqgMaMabvBWKD wax+BWJUrPtetwD5fBPhlS/XdJDnd8Mkv2xsf//+wT0s4p+g++l1APYxeB8QEehm Pd7Mvxm4GvQkfE13QEVIPYQRIXCMH/e9qixtY5SHUZDBVkUyFM0= =bO1i -----END PGP SIGNATURE----- Merge tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 cleanups from Borislav Petkov: "Trivial cleanups and fixes all over the place" * tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Remove me from IDE/ATAPI section x86/pat: Do not compile stubbed functions when X86_PAT is off x86/asm: Ensure asm/proto.h can be included stand-alone x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files x86/msr: Make locally used functions static x86/cacheinfo: Remove unneeded dead-store initialization x86/process/64: Move cpu_current_top_of_stack out of TSS tools/turbostat: Unmark non-kernel-doc comment x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL() x86/fpu/math-emu: Fix function cast warning x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes x86: Fix various typos in comments, take #2 x86: Remove unusual Unicode characters from comments x86/kaslr: Return boolean values from a function returning bool x86: Fix various typos in comments x86/setup: Remove unused RESERVE_BRK_ARRAY() stacktrace: Move documentation for arch_stack_walk_reliable() to header x86: Remove duplicate TSC DEADLINE MSR definitions |
||
![]() |
2c88d45edb |
x86, sched: Treat Intel SNC topology as default, COD as exception
Commit
|
||
![]() |
fa26d0c778 |
ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=m
Commit |
||
![]() |
8cdddd182b |
ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
Commit |
||
![]() |
d9f6e12fb0 |
x86: Fix various typos in comments
Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org |
||
![]() |
d11a1d08a0 |
cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
If the maximum performance level taken for computing the arch_max_freq_ratio value used in the x86 scale-invariance code is higher than the one corresponding to the cpuinfo.max_freq value coming from the acpi_cpufreq driver, the scale-invariant utilization falls below 100% even if the CPU runs at cpuinfo.max_freq or slightly faster, which causes the schedutil governor to select a frequency below cpuinfo.max_freq. That frequency corresponds to a frequency table entry below the maximum performance level necessary to get to the "boost" range of CPU frequencies which prevents "boost" frequencies from being used in some workloads. While this issue is related to scale-invariance, it may be amplified by commit |
||
![]() |
9c7d9017a4 |
x86: PM: Register syscore_ops for scale invariance
On x86 scale invariace tends to be disabled during resume from
suspend-to-RAM, because the MPERF or APERF MSR values are not as
expected then due to updates taking place after the platform
firmware has been invoked to complete the suspend transition.
That, of course, is not desirable, especially if the schedutil
scaling governor is in use, because the lack of scale invariance
causes it to be less reliable.
To counter that effect, modify init_freq_invariance() to register
a syscore_ops object for scale invariance with the ->resume callback
pointing to init_counter_refs() which will run on the CPU starting
the resume transition (the other CPUs will be taken care of the
"online" operations taking place later).
Fixes:
|
||
![]() |
148842c98a |
Yet another large set of x86 interrupt management updates:
- Simplification and distangling of the MSI related functionality - Let IO/APIC construct the RTE entries from an MSI message instead of having IO/APIC specific code in the interrupt remapping drivers - Make the retrieval of the parent interrupt domain (vector or remap unit) less hardcoded and use the relevant irqdomain callbacks for selection. - Allow the handling of more than 255 CPUs without a virtualized IOMMU when the hypervisor supports it. This has made been possible by the above modifications and also simplifies the existing workaround in the HyperV specific virtual IOMMU. - Cleanup of the historical timer_works() irq flags related inconsistencies. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xxd8THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoYpOD/9C5TppNlPMUyx2SflH6bxt37pJEpln +hYTKsk+jSThntr5mfj+GifGvgmHOVBTGnlDUnUnrpN7TQmLFBzwTOtnBLW53AO2 16/u0+Xci4LNCtEkaymf0Rq4MfsfriXHPJr0A/CnZ0tpHSf5QKHAiitSiGujdMlb gbq43+zXd+jNkH7vkOLPX/7dZVI1hNASQEevJu2tRR4xYTuXFdBxvLgYkHtYKKrK R1sbs6nI6yIzye2u4m4xGu29SxgUft+zdUf+UehJKM3yFmf51d9qpkX+kLaTWuaL VPsMItbn0kdvxwXQWO6DYnIAAnVKCklyHQJTZCoNq9Fe91OoByak1CEVspSOa1av JmycNSch4IYWasR4vVCB1gbb+V9SejcKu5SV3CDrEDqwkOIpfiqpriUXSCJTLlFd QOEDOLuuk/79Qs//J/tb/nJ4IuKv8WPudDfIlMro8wUsAr67DjD4mnXprZ+svwWx Ct/0/Memk+BSa0cw6pvg24BUZGN6zrufkBu2HKT9GOXRUdNkdLkiPhT8mK4T/O0l f90QCLjPSOJ/K/pLEWdUHEPmgC5Q9RsXOmwVGqX+RbjfP7mYTJXlmWnBb+cFNch0 xFIH3SxVGylxxT06NX3SkvinrHj10CoAlmneefBlLtx6dF+2P84DAMZSF0OFToVI c2KMg5zoesI4bg== =8Gfs -----END PGP SIGNATURE----- Merge tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Thomas Gleixner: "Yet another large set of x86 interrupt management updates: - Simplification and distangling of the MSI related functionality - Let IO/APIC construct the RTE entries from an MSI message instead of having IO/APIC specific code in the interrupt remapping drivers - Make the retrieval of the parent interrupt domain (vector or remap unit) less hardcoded and use the relevant irqdomain callbacks for selection. - Allow the handling of more than 255 CPUs without a virtualized IOMMU when the hypervisor supports it. This has made been possible by the above modifications and also simplifies the existing workaround in the HyperV specific virtual IOMMU. - Cleanup of the historical timer_works() irq flags related inconsistencies" * tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits) x86/ioapic: Cleanup the timer_works() irqflags mess iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select() iommu/amd: Fix IOMMU interrupt generation in X2APIC mode iommu/amd: Don't register interrupt remapping irqdomain when IR is disabled iommu/amd: Fix union of bitfields in intcapxt support x86/ioapic: Correct the PCI/ISA trigger type selection x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected iommu/hyper-v: Disable IRQ pseudo-remapping if 15 bit APIC IDs are available x86/apic: Support 15 bits of APIC ID in MSI where available x86/ioapic: Handle Extended Destination ID field in RTE iommu/vt-d: Simplify intel_irq_remapping_select() x86: Kill all traces of irq_remapping_get_irq_domain() x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain iommu/hyper-v: Implement select() method on remapping irqdomain iommu/vt-d: Implement select() method on remapping irqdomain iommu/amd: Implement select() method on remapping irqdomain x86/apic: Add select() method on vector irqdomain ... |
||
![]() |
adb35e8dc9 |
Scheduler updates:
- migrate_disable/enable() support which originates from the RT tree and is now a prerequisite for the new preemptible kmap_local() API which aims to replace kmap_atomic(). - A fair amount of topology and NUMA related improvements - Improvements for the frequency invariant calculations - Enhanced robustness for the global CPU priority tracking and decision making - The usual small fixes and enhancements all over the place -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ Gsq0rJW5iiu/OQ== =Oz1V -----END PGP SIGNATURE----- Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: - migrate_disable/enable() support which originates from the RT tree and is now a prerequisite for the new preemptible kmap_local() API which aims to replace kmap_atomic(). - A fair amount of topology and NUMA related improvements - Improvements for the frequency invariant calculations - Enhanced robustness for the global CPU priority tracking and decision making - The usual small fixes and enhancements all over the place * tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits) sched/fair: Trivial correction of the newidle_balance() comment sched/fair: Clear SMT siblings after determining the core is not idle sched: Fix kernel-doc markup x86: Print ratio freq_max/freq_base used in frequency invariance calculations x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC x86, sched: Calculate frequency invariance for AMD systems irq_work: Optimize irq_work_single() smp: Cleanup smp_call_function*() irq_work: Cleanup sched: Limit the amount of NUMA imbalance that can exist at fork time sched/numa: Allow a floating imbalance between NUMA nodes sched: Avoid unnecessary calculation of load imbalance at clone time sched/numa: Rename nr_running and break out the magic number sched: Make migrate_disable/enable() independent of RT sched/topology: Condition EAS enablement on FIE support arm64: Rebuild sched domains on invariance status changes sched/topology,schedutil: Wrap sched domains rebuild sched/uclamp: Allow to reset a task uclamp constraint value sched/core: Fix typos in comments Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug ... |
||
![]() |
3149cd5530 |
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
The value freq_max/freq_base is a fundamental component of frequency invariance calculations. It may come from a variety of sources such as MSRs or ACPI data, tracking it down when troubleshooting a system could be non-trivial. It is worth saving it in the kernel logs. # dmesg | grep 'Estimated ratio of average max' [ 14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289 Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz |
||
![]() |
976df7e573 |
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
Frequency invariant accounting calculations need the ratio freq_curr/freq_max, but freq_max is unknown as it depends on dynamic power allocation between cores: AMD EPYC CPUs implement "Core Performance Boost". Three candidates are considered to estimate this value: - maximum non-boost frequency - maximum boost frequency - the mid point between the above two Experimental data on an AMD EPYC Zen2 machine slightly favors the third option, which is applied with this patch. The analysis uses the ondemand cpufreq governor as baseline, and compares it with schedutil in a number of configurations. Using the freq_max value described above offers a moderate advantage in performance and efficiency: sugov-max (freq_max=max_boost) performs the worst on tbench: less throughput and reduced efficiency than the other invariant-schedutil options (see "Data Overview" below). Consider that tbench is generally a problematic case as no schedutil version currently is better than ondemand. sugov-P0 (freq_max=max_P) is the worst on dbench, while the other sugov's can surpass ondemand with less filesystem latency and slightly increased efficiency. 1. DATA OVERVIEW 2. DETAILED PERFORMANCE TABLES 3. POWER CONSUMPTION TABLE 1. DATA OVERVIEW ================ sugov-noinv : non-invariant schedutil governor sugov-max : invariant schedutil, freq_max=max_boost sugov-mid : invariant schedutil, freq_max=midpoint sugov-P0 : invariant schedutil, freq_max=max_P perfgov : performance governor driver : acpi_cpufreq machine : AMD EPYC 7742 (Zen2, aka "Rome"), dual socket, 128 cores / 256 threads, SATA SSD storage, 250G of memory, XFS filesystem Benchmarks are described in the next section. Tilde (~) means the value is the same as baseline. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ondemand perfgov sugov-noinv sugov-max sugov-mid sugov-P0 better if - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PERFORMANCE RATIOS tbench 1.00 1.44 0.90 0.87 0.93 0.93 higher dbench 1.00 0.91 0.95 0.94 0.94 1.06 lower kernbench 1.00 0.93 ~ ~ ~ 0.97 lower gitsource 1.00 0.66 0.97 0.96 ~ 0.95 lower - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PERFORMANCE-PER-WATT RATIOS tbench 1.00 1.16 0.84 0.84 0.88 0.85 higher dbench 1.00 1.03 1.02 1.02 1.02 0.93 higher kernbench 1.00 1.05 ~ ~ ~ ~ higher gitsource 1.00 1.46 1.04 1.04 ~ 1.05 higher 2. DETAILED PERFORMANCE TABLES ============================== Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.9.0-ondemand (BASELINE) 5.9.0-perfgov 5.9.0-sugov-noinv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 427.19 +- 0.16% ( ) 778.35 +- 0.10% ( 82.20%) 346.92 +- 0.14% ( -18.79%) Hmean 2 853.82 +- 0.09% ( ) 1536.23 +- 0.03% ( 79.93%) 694.36 +- 0.05% ( -18.68%) Hmean 4 1657.54 +- 0.12% ( ) 2938.18 +- 0.12% ( 77.26%) 1362.81 +- 0.11% ( -17.78%) Hmean 8 3301.87 +- 0.06% ( ) 5679.10 +- 0.04% ( 72.00%) 2693.35 +- 0.04% ( -18.43%) Hmean 16 6139.65 +- 0.05% ( ) 9498.81 +- 0.04% ( 54.71%) 4889.97 +- 0.17% ( -20.35%) Hmean 32 11170.28 +- 0.09% ( ) 17393.25 +- 0.08% ( 55.71%) 9104.55 +- 0.09% ( -18.49%) Hmean 64 19322.97 +- 0.17% ( ) 31573.91 +- 0.08% ( 63.40%) 18552.52 +- 0.40% ( -3.99%) Hmean 128 30383.71 +- 0.11% ( ) 37416.91 +- 0.15% ( 23.15%) 25938.70 +- 0.41% ( -14.63%) Hmean 256 31143.96 +- 0.41% ( ) 30908.76 +- 0.88% ( -0.76%) 29754.32 +- 0.24% ( -4.46%) Hmean 512 30858.49 +- 0.26% ( ) 38524.60 +- 1.19% ( 24.84%) 42080.39 +- 0.56% ( 36.37%) Hmean 1024 39187.37 +- 0.19% ( ) 36213.86 +- 0.26% ( -7.59%) 39555.98 +- 0.12% ( 0.94%) 5.9.0-sugov-max 5.9.0-sugov-mid 5.9.0-sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 352.59 +- 1.03% ( -17.46%) 352.08 +- 0.75% ( -17.58%) 352.31 +- 1.48% ( -17.53%) Hmean 2 697.32 +- 0.08% ( -18.33%) 700.16 +- 0.20% ( -18.00%) 696.79 +- 0.06% ( -18.39%) Hmean 4 1369.88 +- 0.04% ( -17.35%) 1369.72 +- 0.07% ( -17.36%) 1365.91 +- 0.05% ( -17.59%) Hmean 8 2696.79 +- 0.04% ( -18.33%) 2711.06 +- 0.04% ( -17.89%) 2715.10 +- 0.61% ( -17.77%) Hmean 16 4725.03 +- 0.03% ( -23.04%) 4875.65 +- 0.02% ( -20.59%) 4953.05 +- 0.28% ( -19.33%) Hmean 32 9231.65 +- 0.10% ( -17.36%) 8704.89 +- 0.27% ( -22.07%) 10562.02 +- 0.36% ( -5.45%) Hmean 64 15364.27 +- 0.19% ( -20.49%) 17786.64 +- 0.15% ( -7.95%) 19665.40 +- 0.22% ( 1.77%) Hmean 128 42100.58 +- 0.13% ( 38.56%) 34946.28 +- 0.13% ( 15.02%) 38635.79 +- 0.06% ( 27.16%) Hmean 256 30660.23 +- 1.08% ( -1.55%) 32307.67 +- 0.54% ( 3.74%) 31153.27 +- 0.12% ( 0.03%) Hmean 512 24604.32 +- 0.14% ( -20.27%) 40408.50 +- 1.10% ( 30.95%) 38800.29 +- 1.23% ( 25.74%) Hmean 1024 35535.47 +- 0.28% ( -9.32%) 41070.38 +- 2.56% ( 4.81%) 31308.29 +- 2.52% ( -20.11%) Benchmark : dbench (filesystem stressor) Varying parameter : number of clients Unit : seconds (lower is better) NOTE-1: This dbench version measures the average latency of a set of filesystem operations, as we found the traditional dbench metric (throughput) to be misleading. NOTE-2: Due to high variability, we partition the original dataset and apply statistical bootrapping (a resampling method). Accuracy is reported in the form of 95% confidence intervals. 5.9.0-ondemand (BASELINE) 5.9.0-perfgov 5.9.0-sugov-noinv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - SubAmean 1 98.79 +- 0.92 ( ) 83.36 +- 0.82 ( 15.62%) 84.82 +- 0.92 ( 14.14%) SubAmean 2 116.00 +- 0.89 ( ) 102.12 +- 0.77 ( 11.96%) 109.63 +- 0.89 ( 5.49%) SubAmean 4 149.90 +- 1.03 ( ) 132.12 +- 0.91 ( 11.86%) 143.90 +- 1.15 ( 4.00%) SubAmean 8 182.41 +- 1.13 ( ) 159.86 +- 0.93 ( 12.36%) 165.82 +- 1.03 ( 9.10%) SubAmean 16 237.83 +- 1.23 ( ) 219.46 +- 1.14 ( 7.72%) 229.28 +- 1.19 ( 3.59%) SubAmean 32 334.34 +- 1.49 ( ) 309.94 +- 1.42 ( 7.30%) 321.19 +- 1.36 ( 3.93%) SubAmean 64 576.61 +- 2.16 ( ) 540.75 +- 2.00 ( 6.22%) 551.27 +- 1.99 ( 4.39%) SubAmean 128 1350.07 +- 4.14 ( ) 1205.47 +- 3.20 ( 10.71%) 1280.26 +- 3.75 ( 5.17%) SubAmean 256 3444.42 +- 7.97 ( ) 3698.00 +- 27.43 ( -7.36%) 3494.14 +- 7.81 ( -1.44%) SubAmean 2048 39457.89 +- 29.01 ( ) 34105.33 +- 41.85 ( 13.57%) 39688.52 +- 36.26 ( -0.58%) 5.9.0-sugov-max 5.9.0-sugov-mid 5.9.0-sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - SubAmean 1 85.68 +- 1.04 ( 13.27%) 84.16 +- 0.84 ( 14.81%) 83.99 +- 0.90 ( 14.99%) SubAmean 2 108.42 +- 0.95 ( 6.54%) 109.91 +- 1.39 ( 5.24%) 112.06 +- 0.91 ( 3.39%) SubAmean 4 136.90 +- 1.04 ( 8.67%) 137.59 +- 0.93 ( 8.21%) 136.55 +- 0.95 ( 8.91%) SubAmean 8 163.15 +- 0.96 ( 10.56%) 166.07 +- 1.02 ( 8.96%) 165.81 +- 0.99 ( 9.10%) SubAmean 16 224.86 +- 1.12 ( 5.45%) 223.83 +- 1.06 ( 5.89%) 230.66 +- 1.19 ( 3.01%) SubAmean 32 320.51 +- 1.38 ( 4.13%) 322.85 +- 1.49 ( 3.44%) 321.96 +- 1.46 ( 3.70%) SubAmean 64 553.25 +- 1.93 ( 4.05%) 554.19 +- 2.08 ( 3.89%) 562.26 +- 2.22 ( 2.49%) SubAmean 128 1264.35 +- 3.72 ( 6.35%) 1256.99 +- 3.46 ( 6.89%) 2018.97 +- 18.79 ( -49.55%) SubAmean 256 3466.25 +- 8.25 ( -0.63%) 3450.58 +- 8.44 ( -0.18%) 5032.12 +- 38.74 ( -46.09%) SubAmean 2048 39133.10 +- 45.71 ( 0.82%) 39905.95 +- 34.33 ( -1.14%) 53811.86 +-193.04 ( -36.38%) Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.9.0-ondemand (BASELINE) 5.9.0-perfgov 5.9.0-sugov-noinv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 471.71 +- 26.61% ( ) 409.88 +- 16.99% ( 13.11%) 430.63 +- 0.18% ( 8.71%) Amean 4 211.87 +- 0.58% ( ) 194.03 +- 0.74% ( 8.42%) 215.33 +- 0.64% ( -1.63%) Amean 8 109.79 +- 1.27% ( ) 101.43 +- 1.53% ( 7.61%) 111.05 +- 1.95% ( -1.15%) Amean 16 59.50 +- 1.28% ( ) 55.61 +- 1.35% ( 6.55%) 59.65 +- 1.78% ( -0.24%) Amean 32 34.94 +- 1.22% ( ) 32.36 +- 1.95% ( 7.41%) 35.44 +- 0.63% ( -1.43%) Amean 64 22.58 +- 0.38% ( ) 20.97 +- 1.28% ( 7.11%) 22.41 +- 1.73% ( 0.74%) Amean 128 17.72 +- 0.44% ( ) 16.68 +- 0.32% ( 5.88%) 17.65 +- 0.96% ( 0.37%) Amean 256 16.44 +- 0.53% ( ) 15.76 +- 0.32% ( 4.18%) 16.76 +- 0.60% ( -1.93%) Amean 512 16.54 +- 0.21% ( ) 15.62 +- 0.41% ( 5.53%) 16.84 +- 0.85% ( -1.83%) 5.9.0-sugov-max 5.9.0-sugov-mid 5.9.0-sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 421.30 +- 0.24% ( 10.69%) 419.26 +- 0.15% ( 11.12%) 414.38 +- 0.33% ( 12.15%) Amean 4 217.81 +- 5.53% ( -2.80%) 211.63 +- 0.99% ( 0.12%) 208.43 +- 0.47% ( 1.63%) Amean 8 108.80 +- 0.43% ( 0.90%) 108.48 +- 1.44% ( 1.19%) 108.59 +- 3.08% ( 1.09%) Amean 16 58.84 +- 0.74% ( 1.12%) 58.37 +- 0.94% ( 1.91%) 57.78 +- 0.78% ( 2.90%) Amean 32 34.04 +- 2.00% ( 2.59%) 34.28 +- 1.18% ( 1.91%) 33.98 +- 2.21% ( 2.75%) Amean 64 22.22 +- 1.69% ( 1.60%) 22.27 +- 1.60% ( 1.38%) 22.25 +- 1.41% ( 1.47%) Amean 128 17.55 +- 0.24% ( 0.97%) 17.53 +- 0.94% ( 1.04%) 17.49 +- 0.43% ( 1.30%) Amean 256 16.51 +- 0.46% ( -0.40%) 16.48 +- 0.48% ( -0.19%) 16.44 +- 1.21% ( 0.00%) Amean 512 16.50 +- 0.35% ( 0.19%) 16.35 +- 0.42% ( 1.14%) 16.37 +- 0.33% ( 0.99%) Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.9.0-ondemand (BASELINE) 5.9.0-perfgov 5.9.0-sugov-noinv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 1035.76 +- 0.30% ( ) 688.21 +- 0.04% ( 33.56%) 1003.85 +- 0.14% ( 3.08%) 5.9.0-sugov-max 5.9.0-sugov-mid 5.9.0-sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 995.82 +- 0.08% ( 3.86%) 1011.98 +- 0.03% ( 2.30%) 986.87 +- 0.19% ( 4.72%) 3. POWER CONSUMPTION TABLE ========================== Average power consumption (watts). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ondemand perfgov sugov-noinv sugov-max sugov-mid sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - tbench4 227.25 281.83 244.17 236.76 241.50 247.99 dbench4 151.97 161.87 157.08 158.10 158.06 153.73 kernbench 162.78 167.22 162.90 164.19 164.65 164.72 gitsource 133.65 139.00 133.04 134.43 134.18 134.32 Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201112182614.10700-3-ggherdovich@suse.cz |
||
![]() |
41ea667227 |
x86, sched: Calculate frequency invariance for AMD systems
This is the first pass in creating the ability to calculate the frequency invariance on AMD systems. This approach uses the CPPC highest performance and nominal performance values that range from 0 - 255 instead of a high and base frquency. This is because we do not have the ability on AMD to get a highest frequency value. On AMD systems the highest performance and nominal performance vaues do correspond to the highest and base frequencies for the system so using them should produce an appropriate ratio but some tweaking is likely necessary. Due to CPPC being initialized later in boot than when the frequency invariant calculation is currently made, I had to create a callback from the CPPC init code to do the calculation after we have CPPC data. Special thanks to "kernel test robot <lkp@intel.com>" for reporting that compilation of drivers/acpi/cppc_acpi.c is conditional to CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI. [ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ] Signed-off-by: Nathan Fontenot <nathan.fontenot@amd.com> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz |
||
![]() |
29368e0939 |
x86/smpboot: Move rcu_cpu_starting() earlier
The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough in the CPU-hotplug onlining process, which results in lockdep splats as follows: ============================= WARNING: suspicious RCU usage 5.9.0+ #268 Not tainted ----------------------------- kernel/kprobes.c:300 RCU-list traversed in non-reader section!! other info that might help us debug this: RCU used illegally from offline CPU! rcu_scheduler_active = 1, debug_locks = 1 no locks held by swapper/1/0. stack backtrace: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x77/0x97 __is_insn_slot_addr+0x15d/0x170 kernel_text_address+0xba/0xe0 ? get_stack_info+0x22/0xa0 __kernel_text_address+0x9/0x30 show_trace_log_lvl+0x17d/0x380 ? dump_stack+0x77/0x97 dump_stack+0x77/0x97 __lock_acquire+0xdf7/0x1bf0 lock_acquire+0x258/0x3d0 ? vprintk_emit+0x6d/0x2c0 _raw_spin_lock+0x27/0x40 ? vprintk_emit+0x6d/0x2c0 vprintk_emit+0x6d/0x2c0 printk+0x4d/0x69 start_secondary+0x1c/0x100 secondary_startup_64_no_verify+0xb8/0xbb This is avoided by moving the call to rcu_cpu_starting up near the beginning of the start_secondary() function. Note that the raw_smp_processor_id() is required in order to avoid calling into lockdep before RCU has declared the CPU to be watched for readers. Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/ Reported-by: Qian Cai <cai@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> |
||
![]() |
8c44963b60 |
x86/apic: Cleanup destination mode
apic::irq_dest_mode is actually a boolean, but defined as u32 and named in a way which does not explain what it means. Make it a boolean and rename it to 'dest_mode_logical' Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201024213535.443185-9-dwmw2@infradead.org |
||
![]() |
e57d04e5fa |
x86/apic: Get rid of apic:: Dest_logical
struct apic has two members which store information about the destination mode: dest_logical and irq_dest_mode. dest_logical contains a mask which was historically used to set the destination mode in IPI messages. Over time the usage was reduced and the logical/physical functions were seperated. There are only a few places which still use 'dest_logical' but they can use 'irq_dest_mode' instead. irq_dest_mode is actually a boolean where 0 means physical destination mode and 1 means logical destination mode. Of course the name does not reflect the functionality. This will be cleaned up in a subsequent change. Remove apic::dest_logical and fixup the remaining users. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201024213535.443185-8-dwmw2@infradead.org |