2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

4080 Commits

Author SHA1 Message Date
Colton Lewis
04782e6391 perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()
For clarity, rename the arch-specific definitions of these functions
to perf_arch_* to denote they are arch-specifc. Define the
generic-named functions in one place where they can call the
arch-specific ones as needed.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Acked-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20241113190156.2145593-3-coltonlewis@google.com
2024-11-14 10:40:01 +01:00
Heiko Carstens
a6122f690a s390/diag: Convert to use flag output macros
Use flag output macros in inline asm to allow for better code generation if
the compiler has support for the flag output constraint.

Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-13 14:31:33 +01:00
Heiko Carstens
5a5897d65b s390/irq: Convert to use flag output macros
Use flag output macros in inline asm to allow for better code generation if
the compiler has support for the flag output constraint.

Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-13 14:31:32 +01:00
Heiko Carstens
ca6dd1faa0 s390/cpcmd: Convert to use flag output macros
Use flag output macros in inline asm to allow for better code generation if
the compiler has support for the flag output constraint.

Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-13 14:31:32 +01:00
Heiko Carstens
d4e50cfe9c s390/topology: Convert to use flag output macros
Use flag output macros in inline asm to allow for better code generation if
the compiler has support for the flag output constraint.

Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-13 14:31:31 +01:00
Heiko Carstens
eade39cc72 s390/sthyi: Convert to use flag output macros
Use flag output macros in inline asm to allow for better code generation if
the compiler has support for the flag output constraint.

Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-13 14:31:31 +01:00
Masahiro Yamada
182c02a6cd s390/syscalls: Convert filechk to if_changed
The filechk macro always executes the syscalltbl script (and discards
the output if there are no changes).

Using if_changed is more efficient because it avoids running the script
when the target is up-to-date and the command remains unchanged.

All other architectures use if_changed for generating syscall headers.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20241111134603.2063226-3-masahiroy@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-12 14:01:30 +01:00
Masahiro Yamada
e17aca2005 s390/syscalls: Remove unnecessary argument of filechk_syshdr
The filechk_syshdr macro receives $@ in both cases, making the argument
redundant.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20241111134603.2063226-2-masahiroy@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-12 14:01:30 +01:00
Masahiro Yamada
0708967e2d s390/syscalls: Avoid creation of arch/arch/ directory
Building the kernel with ARCH=s390 creates a weird arch/arch/ directory.

  $ find arch/arch
  arch/arch
  arch/arch/s390
  arch/arch/s390/include
  arch/arch/s390/include/generated
  arch/arch/s390/include/generated/asm
  arch/arch/s390/include/generated/uapi
  arch/arch/s390/include/generated/uapi/asm

The root cause is 'targets' in arch/s390/kernel/syscalls/Makefile,
where the relative path is incorrect.

Strictly speaking, 'targets' was not necessary in the first place
because this Makefile uses 'filechk' instead of 'if_changed'.

However, this commit keeps it, as it will be useful when converting
'filechk' to 'if_changed' later.

Fixes: 5c75824d91 ("s390/syscalls: add Makefile to generate system call header files")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20241111134603.2063226-1-masahiroy@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-12 14:01:29 +01:00
Heiko Carstens
42898f74b2 s390/perf_cpum_cf: Convert to use local64_try_cmpxchg()
Convert local64_cmpxchg() usages to local64_try_cmpxchg() in
order to generate slightly better code.

Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-12 14:01:29 +01:00
Heiko Carstens
e449399ffd s390/perf_cpum_sf: Convert to use try_cmpxchg128()
Convert cmpxchg128() usages to try_cmpxchg128() in order to generate
slightly better code.

Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-12 14:01:29 +01:00
Alexander Egorenkov
01bfb451a3 s390/dump: Add firmware sysfs attribute for dump area size
Dump tools from s390-tools such as zipl need to know the correct dump area
size of the machine they run on in order to be able to create valid
standalone dumper images. Therefore, allow it to be obtained through
the new sysfs read-only attribute /sys/firmware/dump/dump_area_size.

Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Suggested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-12 14:01:27 +01:00
Christian Göttsche
6140be90ec fs/xattr: add *at family syscalls
Add the four syscalls setxattrat(), getxattrat(), listxattrat() and
removexattrat().  Those can be used to operate on extended attributes,
especially security related ones, either relative to a pinned directory
or on a file descriptor without read access, avoiding a
/proc/<pid>/fd/<fd> detour, requiring a mounted procfs.

One use case will be setfiles(8) setting SELinux file contexts
("security.selinux") without race conditions and without a file
descriptor opened with read access requiring SELinux read permission.

Use the do_{name}at() pattern from fs/open.c.

Pass the value of the extended attribute, its length, and for
setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added
struct xattr_args to not exceed six syscall arguments and not
merging the AT_* and XATTR_* flags.

[AV: fixes by Christian Brauner folded in, the entire thing rebased on
top of {filename,file}_...xattr() primitives, treatment of empty
pathnames regularized.  As the result, AT_EMPTY_PATH+NULL handling
is cheap, so f...(2) can use it]

Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Link: https://lore.kernel.org/r/20240426162042.191916-1-cgoettsche@seltendoof.de
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
CC: x86@kernel.org
CC: linux-alpha@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-arm-kernel@lists.infradead.org
CC: linux-ia64@vger.kernel.org
CC: linux-m68k@lists.linux-m68k.org
CC: linux-mips@vger.kernel.org
CC: linux-parisc@vger.kernel.org
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-s390@vger.kernel.org
CC: linux-sh@vger.kernel.org
CC: sparclinux@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
CC: audit@vger.kernel.org
CC: linux-arch@vger.kernel.org
CC: linux-api@vger.kernel.org
CC: linux-security-module@vger.kernel.org
CC: selinux@vger.kernel.org
[brauner: slight tweaks]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06 12:59:44 -05:00
Thomas Weißschuh
98333a84e3 s390/vdso: Drop LBASE_VDSO
This constant is always "0", providing no value and making the logic
harder to understand.
Also prepare for a consolidation of the vdso linkerscript logic by
aligning it with other architectures.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-3-b64f0842d512@linutronix.de
2024-11-02 12:37:32 +01:00
Thomas Richter
f55bd479d8 s390/cpum_sf: Fix and protect memory allocation of SDBs with mutex
Reservation of the PMU hardware is done at first event creation
and is protected by a pair of mutex_lock() and mutex_unlock().
After reservation of the PMU hardware the memory
required for the PMUs the event is to be installed on is
allocated by allocate_buffers() and alloc_sampling_buffer().
This done outside of the mutex protection.
Without mutex protection two or more concurrent invocations of
perf_event_init() may run in parallel.
This can lead to allocation of Sample Data Blocks (SDBs)
multiple times for the same PMU.
Prevent this and protect memory allocation of SDBs by
mutex.

Fixes: 8a6fe8f21e ("s390/cpum_sf: Use refcount_t instead of atomic_t")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-31 10:50:06 +01:00
Sven Schnelle
2d7de7a301 s390/time: Add PtP driver
Add a small PtP driver which allows user space to get
the values of the physical and tod clock. This allows
programs like chrony to use STP as clock source and
steer the kernel clock. The physical clock can be used
as a debugging aid to get the clock without any additional
offsets like STP steering or LPAR offset.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://patch.msgid.link/20241023065601.449586-3-svens@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-30 17:02:39 -07:00
Sven Schnelle
f247fd22e9 s390/time: Add clocksource id to TOD clock
To allow specifying the clock source in the upcoming PtP driver,
add a clocksource ID to the s390 TOD clock.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://patch.msgid.link/20241023065601.449586-2-svens@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-30 17:02:38 -07:00
Claudio Imbrenda
e8d8d97218 s390: Remove gmap pointer from lowcore
Remove the gmap pointer from lowcore, since it is not used anymore.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-9-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:49:19 +01:00
Claudio Imbrenda
05066cafa9 s390/mm/fault: Handle guest-related program interrupts in KVM
Any program interrupt that happens in the host during the execution of
a KVM guest will now short circuit the fault handler and return to KVM
immediately. Guest fault handling (including pfault) will happen
entirely inside KVM.

When sie64a() returns zero, current->thread.gmap_int_code will contain
the program interrupt number that caused the exit, or zero if the exit
was not caused by a host program interrupt.

KVM will now take care of handling all guest faults in vcpu_post_run().

Since gmap faults will not be visible by the rest of the kernel, remove
GMAP_FAULT, the linux fault handlers for secure execution faults, the
exception table entries for the sie instruction, the nop padding after
the sie instruction, and all other references to guest faults from the
s390 code.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Co-developed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-6-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:49:18 +01:00
Claudio Imbrenda
f96cb0d61d s390/entry: Remove __GMAP_ASCE and use _PIF_GUEST_FAULT again
Now that the guest ASCE is passed as a parameter to __sie64a(),
_PIF_GUEST_FAULT can be used again to determine whether the fault was a
guest or host fault.

Since the guest ASCE will not be taken from the gmap pointer in lowcore
anymore, __GMAP_ASCE can be removed. For the same reason the guest
ASCE needs now to be saved into the cr1 save area unconditionally.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-2-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:49:18 +01:00
Thomas Richter
f4d5e64c5c s390/cpum_sf: Rework call to sf_disable()
Setup_pmc_cpu() function body consists of one single switch
statement with two cases PMC_INIT and PMC_RELEASE.
In both of these cases sf_disable() is invoked to turn off the
CPU Measurement sampling facility.
Move sf_disable() out of the switch statement.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:20 +01:00
Thomas Richter
a0bd7dacbd s390/cpum_sf: Handle CPU hotplug remove during sampling
CPU hotplug remove handling triggers the following function
call sequence:

   CPUHP_AP_PERF_S390_SF_ONLINE  --> s390_pmu_sf_offline_cpu()
   ...
   CPUHP_AP_PERF_ONLINE          --> perf_event_exit_cpu()

The s390 CPUMF sampling CPU hotplug handler invokes:

 s390_pmu_sf_offline_cpu()
 +-->  cpusf_pmu_setup()
       +--> setup_pmc_cpu()
            +--> deallocate_buffers()

This function de-allocates all sampling data buffers (SDBs) allocated
for that CPU at event initialization. It also clears the
PMU_F_RESERVED bit. The CPU is gone and can not be sampled.

With the event still being active on the removed CPU, the CPU event
hotplug support in kernel performance subsystem triggers the
following function calls on the removed CPU:

  perf_event_exit_cpu()
  +--> perf_event_exit_cpu_context()
       +--> __perf_event_exit_context()
	    +--> __perf_remove_from_context()
	         +--> event_sched_out()
	              +--> cpumsf_pmu_del()
	                   +--> cpumsf_pmu_stop()
                                +--> hw_perf_event_update()

to stop and remove the event. During removal of the event, the
sampling device driver tries to read out the remaining samples from
the sample data buffers (SDBs). But they have already been freed
(and may have been re-assigned). This may lead to a use after free
situation in which case the samples are most likely invalid. In the
best case the memory has not been reassigned and still contains
valid data.

Remedy this situation and check if the CPU is still in reserved
state (bit PMU_F_RESERVED set). In this case the SDBs have not been
released an contain valid data. This is always the case when
the event is removed (and no CPU hotplug off occured).
If the PMU_F_RESERVED bit is not set, the SDB buffers are gone.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:18 +01:00
Thomas Richter
0c25132396 s390/cpum_sf: Fix format string in pr_err()
Fix format string in pr_err() and use the built-in
hexadecimal prefix %#x to display a number with a leading
hexadecimal indicator 0x.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:18 +01:00
Thomas Richter
f2e9d46ac6 s390/cpum_sf: Use sf_buffer_available()
Use sf_buffer_available() consistently throughtout the code
to test for the existence of sampling buffer.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:18 +01:00
Thomas Richter
de6d22ccdc s390/cpum_sf: Consistently use goto out for function exit
When the sampling buffer allocation fails in
__hw_perf_event_init(), jump to the end of the function
and return the result. This is consistent with the other
error handling and return conditions in this function.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Reviewed-By: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:18 +01:00
Thomas Richter
db417646fe s390/cpum_sf: Do not re-enable event after deletion
Event delete removes an event from the event list, but common
code invokes the PMU's enable function later on. This happens
in event_sched_out() and leads to the following call sequence:

  event_sched_out()
  +--> cpumsf_pmu_del()
  +--> cpumsf_pmu_enable()

In cpumsf_pmu_enable() return immediately when the event is not
active. Also remove an unneeded if clause. That if() statement
is only reached when flag PMU_F_IN_USE has been set in
cpumsf_pmu_add(). And this function also sets cpuhw->event
to a valid value.

Remove WARN_ON_ONCE() statement which never triggered.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:18 +01:00
Steffen Eiden
f00469a642 s390/uv: Retrieve UV secrets sysfs support
Reflect the updated content in the query information UVC to the sysfs at
/sys/firmware/query

* new UV-query sysfs entry for the maximum number of retrievable
  secrets the UV can store for one secure guest.
* new UV-query sysfs entry for the maximum number of association
  secrets the UV can store for one secure guest.
* max_secrets contains the sum of max association and max retrievable
  secrets.

Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20241024062638.1465970-7-seiden@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:17 +01:00
Steffen Eiden
7c9137af20 s390/uv: Retrieve UV secrets support
Provide a kernel API to retrieve secrets from the UV secret store.
Add two new functions:
* `uv_get_secret_metadata` - get metadata for a given secret identifier
* `uv_retrieve_secret` - get the secret value for the secret index

With those two functions one can extract the secret for a given secret
id, if the secret is retrievable.

Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20241024084107.2418186-1-seiden@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:16 +01:00
Steffen Eiden
28a51ee8eb s390/uv: Provide host-key hashes in sysfs
Utilize the new Query Ultravisor Keys UVC to give user space the
information which host-keys are installed on the system.

Create a new sysfs directory 'firmware/uv/keys' that contains the hash
of the host-key and the backup host-key of that system. Additionally,
the file 'all' contains the response from the UVC possibly containing
more key-hashes than currently known.

Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20241023075529.2561384-1-seiden@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:16 +01:00
Steffen Eiden
bb4ad73a28 s390/uv: Refactor uv-sysfs creation
Streamline the sysfs generation to make it more extensible.
Add a function to create a sysfs entry in the uv-sysfs dir.
Use this function for the query directory.

Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20241015113940.3088249-2-seiden@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:17:16 +01:00
Mete Durlu
401721a54c s390/ipl: Switch over to sysfs_emit()
Per Documentation/filesystems/sysfs.rst, sysfs_emit() is preferred over
sprintf for presenting attributes to user space. Convert the left-over
uses in the s390/ipl code.

Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-25 16:03:24 +02:00
Mete Durlu
c50262498d s390/nospec: Switch over to sysfs_emit()
Per Documentation/filesystems/sysfs.rst, sysfs_emit() is preferred over
sprintf for presenting attributes to user space. Convert the left-over
uses in the s390/nospec-sysfs code.

Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-25 16:03:24 +02:00
Mete Durlu
d151f8f788 s390/perf_event: Switch over to sysfs_emit()
Per Documentation/filesystems/sysfs.rst, sysfs_emit() is preferred over
sprintf for presenting attributes to user space. Convert the left-over
uses in the s390/perf_event code.

Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-25 16:03:24 +02:00
Mete Durlu
5b2a85a24b s390/smp: Switch over to sysfs_emit()
Per Documentation/filesystems/sysfs.rst, sysfs_emit() is preferred over
sprintf for presenting attributes to user space. Convert the left-over
uses in the s390/smp code.

Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-25 16:03:24 +02:00
Mete Durlu
b4b920cded s390/time: Switch over to sysfs_emit()
Per Documentation/filesystems/sysfs.rst, sysfs_emit() is preferred over
sprintf for presenting attributes to user space. Convert the left-over
uses in the s390/time code.

Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-25 16:03:23 +02:00
Mete Durlu
f94de4f17b s390/topology: Switch over to sysfs_emit()
Per Documentation/filesystems/sysfs.rst, sysfs_emit() is preferred over
sprintf for presenting attributes to user space. Convert the left-over
uses in the s390/topology code.

Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-25 16:03:23 +02:00
David Hildenbrand
82a0fcb1ad s390/kdump: Provide is_kdump_kernel() implementation
s390 sets "elfcorehdr_addr = ELFCORE_ADDR_MAX;" early during
setup_arch() to deactivate the "elfcorehdr= kernel" parameter,
resulting in is_kdump_kernel() returning "false".

During vmcore_init()->elfcorehdr_alloc(), if on a dump kernel and
allocation succeeded, elfcorehdr_addr will be set to a valid address
and is_kdump_kernel() will consequently return "true".

is_kdump_kernel() should return a consistent result during all boot
stages, and properly return "true" if in a kdump environment - just
like it is done on powerpc where "false" is indicated in fadump
environments, as added in commit b098f1c323 ("powerpc/fadump: make
is_kdump_kernel() return false when fadump is active").

Similarly provide a custom is_kdump_kernel() implementation that will only
return "true" in kdump environments, and will do so consistently during
boot.

Update the documentation of dump_available().

Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Link: https://lore.kernel.org/r/20241023090651.1115507-1-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-25 16:03:23 +02:00
Heiko Carstens
e6ebf0d651 s390: Fix various typos
Run codespell on arch/s390 and drivers/s390 and fix all typos.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-25 16:03:23 +02:00
Bart Van Assche
951248383a s390/irq: Switch to irq_get_nr_irqs()
Use the irq_get_nr_irqs() function instead of the global variable
'nr_irqs'. Prepare for changing 'nr_irqs' from an exported global
variable into a variable with file scope.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241015190953.1266194-6-bvanassche@acm.org
2024-10-16 21:56:57 +02:00
Thomas Weißschuh
3aa8881ebd s390/vdso: Remove timekeeper includes
Since the generic VDSO clock mode storage is used, this header file is
unused and can be removed.

This avoids including a non-VDSO header while building the VDSO,
which can lead to compilation errors.

Also drop the comment which is out of date and in the wrong place.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/all/20241010-vdso-generic-arch_update_vsyscall-v1-6-7fe5a3ea4382@linutronix.de
2024-10-15 17:50:29 +02:00
Steven Rostedt
7888af4166 ftrace: Make ftrace_regs abstract from direct use
ftrace_regs was created to hold registers that store information to save
function parameters, return value and stack. Since it is a subset of
pt_regs, it should only be used by its accessor functions. But because
pt_regs can easily be taken from ftrace_regs (on most archs), it is
tempting to use it directly. But when running on other architectures, it
may fail to build or worse, build but crash the kernel!

Instead, make struct ftrace_regs an empty structure and have the
architectures define __arch_ftrace_regs and all the accessor functions
will typecast to it to get to the actual fields. This will help avoid
usage of ftrace_regs directly.

Link: https://lore.kernel.org/all/20241007171027.629bdafd@gandalf.local.home/

Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "x86@kernel.org" <x86@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Paul  Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas  Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav  Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/20241008230628.958778821@goodmis.org
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-10 20:18:01 -04:00
Antonia Jonas
f626e79bfe s390/cpum_cf: Correct typo CYLCE
Signed-off-by: Antonia Jonas <antonia@toertel.de>
Link: https://lore.kernel.org/r/20241003115648.26188-1-antonia@toertel.de
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-10 15:32:44 +02:00
Thomas Richter
31d1d8a35e s390/cpum_sf: Set bit PMU_F_ENABLED enabled after lpp() invocation
Set PMU enabled bit after lpp() call. This ensures the proper
task PID is loaded by the llp instruction into Program Parameter
register from where it is copied to each sampling data buffer
(SDB) sample entry after the sampling was enabled.

The barrier() instruction is not needed as cpumsf_pmu_enable()
changes a CPU specific variable. Only the CPU that task is
running on changes structure members.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-10 15:32:43 +02:00
Linus Torvalds
e08d227840 more s390 updates for 6.12 merge window
- Clean up and improve vdso code: use SYM_* macros for function and
   data annotations, add CFI annotations to fix GDB unwinding, optimize
   the chacha20 implementation
 
 - Add vfio-ap driver feature advertisement for use by libvirt and mdevctl
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmb39nkACgkQjYWKoQLX
 FBiacggAlHrwDYIZdr7bd+0bJa8Wq5STrdo1XiohQhCrRg+4rXRaquEQszWVheRk
 qi/9u+MKJFtbd4VMbf+WqEi2P4Pnn5cByv+As3B4RMR0Tp3uQK5TlDbh7wbJ+pIG
 AQQoL5D2z05rvbYgc1Wvt/3lgrfs2EzUY3cAyIMmFwSp0On+psDuwuFVtWzH0jCr
 fxCwX7mrF6OoBMR0QSUxAvoOujUhd4CANk3n6rNxJfi2AC/gX/ZkFUrV5a3FvJ2Z
 X7fQgDc3rivsm5wFuoBvlrH5J1jhmWtsg8Tiygos7BBehpQ5NUsJMDGS8kR+avbF
 iYKg1oMy5/ZXekBjqkEMO6Vp3nCrdg==
 =Idf7
 -----END PGP SIGNATURE-----

Merge tag 's390-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Vasily Gorbik:

 - Clean up and improve vdso code: use SYM_* macros for function and
   data annotations, add CFI annotations to fix GDB unwinding, optimize
   the chacha20 implementation

 - Add vfio-ap driver feature advertisement for use by libvirt and
   mdevctl

* tag 's390-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/vfio-ap: Driver feature advertisement
  s390/vdso: Use one large alternative instead of an alternative branch
  s390/vdso: Use SYM_DATA_START_LOCAL()/SYM_DATA_END() for data objects
  tools: Add additional SYM_*() stubs to linkage.h
  s390/vdso: Use macros for annotation of asm functions
  s390/vdso: Add CFI annotations to __arch_chacha20_blocks_nostack()
  s390/vdso: Fix comment within __arch_chacha20_blocks_nostack()
  s390/vdso: Get rid of permutation constants
2024-09-28 09:11:46 -07:00
Al Viro
cb787f4ac0 [tree-wide] finally take no_llseek out
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")

To quote that commit,

  At -rc1 we'll need do a mechanical removal of no_llseek -

  git grep -l -w no_llseek | grep -v porting.rst | while read i; do
	sed -i '/\<no_llseek\>/d' $i
  done

  would do it.

Unfortunately, that hadn't been done.  Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
	.llseek = no_llseek,
so it's obviously safe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-27 08:18:43 -07:00
Linus Torvalds
348325d644 asm-generic updates for 6.12
These are only two small patches, one cleanup for arch/alpha
 and a preparation patch cleaning up the handling of runtime
 constants in the linker scripts.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmboHV0ACgkQYKtH/8kJ
 UifHfhAAqTHHxxe+HiphGBPHN0ODyLVUs7fOQHtLOSmJlQa6x1TCR/+1nL1kTDbe
 j6EcIRxZrllQZ+jZBA8z2XsAmjjBLUxCB4yu6oxYJh8OdFyqeVM/myZEr2TAyb0o
 A3D9b+rfnY8sr9XaFHSHGWbh4c33cGQhACumHVAjtPvU06Voskq4pAf9ZnpGkNBe
 AdKNTVG6+w84dKUNuzXcexP8d7SnsXNfd6T9+evtW/M+fziWzs3aPQr+GZED96E5
 8IRldXi2nzIwm9LT5IzZAt+QvpVb2Qob1+rej9p5WpptGp840CROTo61SwaYHCMV
 DDxTlmADsApWJQ3B5gDu6QS2jXT4eeOrY3JI2baeCyOV6auj15UXKiWc2QVoHOVU
 6+PzlSFuLatI6WsxXfOcD0o3bfQXMKS6zCC/4eD7Y/SmmMqBbL5+d9sU5lwkiOFl
 swoswF4HTwo5d6NdkSuJOt6KA/V8a68lBhKYBXHu2yuLi/LDNOaipEvBHQLzfnlY
 91e5DtDiHK9CYDNkwiR+bV9rQnhA535JSlfR8VtpU/SJTTjyF+dkt9JGPdivXoIA
 8Zv+DN/oyrahUtCrgzzPXahOuBrfD/WfIajsvpEK6vNPuBhscsZFg/thc70FMIXo
 qn8Dmpi/CnDWFNOy0xO0cbYWrGBGn9E7kzbSZ78tUIjPUmmEKfk=
 =OOMl
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic updates from Arnd Bergmann:
 "These are only two small patches, one cleanup for arch/alpha and a
  preparation patch cleaning up the handling of runtime constants in the
  linker scripts"

* tag 'asm-generic-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  runtime constants: move list of constants to vmlinux.lds.h
  alpha: no need to include asm/xchg.h twice
2024-09-26 11:54:40 -07:00
Heiko Carstens
d714abee5f s390/vdso: Use one large alternative instead of an alternative branch
Replace the alternative branch with a larger alternative that contains
both paths. That way the two paths are closer together and it is easier
to change both paths if the need should arise.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23 17:57:04 +02:00
Heiko Carstens
c902b578ee s390/vdso: Use SYM_DATA_START_LOCAL()/SYM_DATA_END() for data objects
Use SYM_DATA_START_LOCAL()/SYM_DATA_END() in vgetrandom-chacha.S so
that the constants end up in an object with correct size:

readelf -Ws vgetrandom-chacha.o
   Num:    Value          Size Type    Bind   Vis      Ndx Name
   ...
     5: 0000000000000000    32 OBJECT  LOCAL  DEFAULT    5 chacha20_constants

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23 17:57:04 +02:00
Jens Remus
d361390d9f s390/vdso: Use macros for annotation of asm functions
Use the macros SYM_FUNC_START() and SYM_FUNC_END() to annotate the
functions in assembly.

Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23 17:57:04 +02:00
Jens Remus
5cccfc8be6 s390/vdso: Add CFI annotations to __arch_chacha20_blocks_nostack()
This allows proper unwinding, for instance when using a debugger such
as GDB.

Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23 17:57:04 +02:00
Heiko Carstens
ff35a3f0ca s390/vdso: Fix comment within __arch_chacha20_blocks_nostack()
Fix comment within __arch_chacha20_blocks_nostack() so the comment
reflects what the code is doing.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23 17:57:04 +02:00
Heiko Carstens
8e391ae060 s390/vdso: Get rid of permutation constants
The three byte masks for VECTOR PERMUTE are not needed, since the
instruction VECTOR SHIFT LEFT DOUBLE BY BYTE can be used to implement
the required "rotate left" easily.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23 17:57:04 +02:00
Linus Torvalds
1ec6d09789 s390 updates for 6.12 merge window
- Optimize ftrace and kprobes code patching and avoid stop machine for
   kprobes if sequential instruction fetching facility is available
 
 - Add hiperdispatch feature to dynamically adjust CPU capacity in
   vertical polarization to improve scheduling efficiency and overall
   performance. Also add infrastructure for handling warning track
   interrupts (WTI), allowing for graceful CPU preemption
 
 - Rework crypto code pkey module and split it into separate, independent
   modules for sysfs, PCKMO, CCA, and EP11, allowing modules to load only
   when the relevant hardware is available
 
 - Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
   utilizing message-security assist extensions (MSA) 10 and 11. It
   introduces new shash implementations for HMAC-SHA224/256/384/512 and
   registers the hardware-accelerated AES-XTS cipher as the preferred
   option. Also add clear key token support
 
 - Add MSA 10 and 11 processor activity instrumentation counters to perf
   and update PAI Extension 1 NNPA counters
 
 - Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
   statements
 
 - Add support for SHA3 performance enhancements introduced with MSA 12
 
 - Add support for the query authentication information feature of
   MSA 13 and introduce the KDSA CPACF instruction. Provide query and query
   authentication information in sysfs, enabling tools like cpacfinfo to
   present this data in a human-readable form
 
 - Update kernel disassembler instructions
 
 - Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
   kpatch compatibility
 
 - Add missing warning handling and relocated lowcore support to the
   early program check handler
 
 - Optimize ftrace_return_address() and avoid calling unwinder
 
 - Make modules use kernel ftrace trampolines
 
 - Strip relocs from the final vmlinux ELF file to make it roughly 2
   times smaller
 
 - Dump register contents and call trace for early crashes to the console
 
 - Generate ptdump address marker array dynamically
 
 - Fix rcu_sched stalls that might occur when adding or removing large
   amounts of pages at once to or from the CMM balloon
 
 - Fix deadlock caused by recursive lock of the AP bus scan mutex
 
 - Unify sync and async register save areas in entry code
 
 - Cleanup debug prints in crypto code
 
 - Various cleanup and sanitizing patches for the decompressor
 
 - Various small ftrace cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmbsZawACgkQjYWKoQLX
 FBg+Ogf+NiKPfvI14NcTwnOHB6qz8ApPdGfN9bNVtQxtK3epeAvtj0cMonAuKpRg
 xckTRRd8y0guhCT7Q2+WitSgA5eYDn+u9/Ux5YuKUdUdXolQ0D64BJNtVeEFkmJj
 s+Lesb8cVI9T2VBZOpuF9lJigfsDALBkFroqN4MDudDeahS+qy33bAc0OoqYNXHo
 S6OwPK1/tEG9O/oTN2V4mN+aP0B3/dl7Msezb0gfAXQJA+WUAyMNK0RHvoG9uzaa
 BWAyWWYABj6woGZEAQAzXcbzkQiRPixTqZVe6e4YndXhIlEnB/Z2AQFdTpT9V7En
 eOmmve3QuJa0hkF9q4H/anvOMPntTg==
 =Xagq
 -----END PGP SIGNATURE-----

Merge tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:

 - Optimize ftrace and kprobes code patching and avoid stop machine for
   kprobes if sequential instruction fetching facility is available

 - Add hiperdispatch feature to dynamically adjust CPU capacity in
   vertical polarization to improve scheduling efficiency and overall
   performance. Also add infrastructure for handling warning track
   interrupts (WTI), allowing for graceful CPU preemption

 - Rework crypto code pkey module and split it into separate,
   independent modules for sysfs, PCKMO, CCA, and EP11, allowing modules
   to load only when the relevant hardware is available

 - Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
   utilizing message-security assist extensions (MSA) 10 and 11. It
   introduces new shash implementations for HMAC-SHA224/256/384/512 and
   registers the hardware-accelerated AES-XTS cipher as the preferred
   option. Also add clear key token support

 - Add MSA 10 and 11 processor activity instrumentation counters to perf
   and update PAI Extension 1 NNPA counters

 - Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
   statements

 - Add support for SHA3 performance enhancements introduced with MSA 12

 - Add support for the query authentication information feature of MSA
   13 and introduce the KDSA CPACF instruction. Provide query and query
   authentication information in sysfs, enabling tools like cpacfinfo to
   present this data in a human-readable form

 - Update kernel disassembler instructions

 - Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
   kpatch compatibility

 - Add missing warning handling and relocated lowcore support to the
   early program check handler

 - Optimize ftrace_return_address() and avoid calling unwinder

 - Make modules use kernel ftrace trampolines

 - Strip relocs from the final vmlinux ELF file to make it roughly 2
   times smaller

 - Dump register contents and call trace for early crashes to the
   console

 - Generate ptdump address marker array dynamically

 - Fix rcu_sched stalls that might occur when adding or removing large
   amounts of pages at once to or from the CMM balloon

 - Fix deadlock caused by recursive lock of the AP bus scan mutex

 - Unify sync and async register save areas in entry code

 - Cleanup debug prints in crypto code

 - Various cleanup and sanitizing patches for the decompressor

 - Various small ftrace cleanups

* tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (84 commits)
  s390/crypto: Display Query and Query Authentication Information in sysfs
  s390/crypto: Add Support for Query Authentication Information
  s390/crypto: Rework RRE and RRF CPACF inline functions
  s390/crypto: Add KDSA CPACF Instruction
  s390/disassembler: Remove duplicate instruction format RSY_RDRU
  s390/boot: Move boot_printk() code to own file
  s390/boot: Use boot_printk() instead of sclp_early_printk()
  s390/boot: Rename decompressor_printk() to boot_printk()
  s390/boot: Compile all files with the same march flag
  s390: Use MARCH_HAS_*_FEATURES defines
  s390: Provide MARCH_HAS_*_FEATURES defines
  s390/facility: Disable compile time optimization for decompressor code
  s390/boot: Increase minimum architecture to z10
  s390/als: Remove obsolete comment
  s390/sha3: Fix SHA3 selftests failures
  s390/pkey: Add AES xts and HMAC clear key token support
  s390/cpacf: Add MSA 10 and 11 new PCKMO functions
  s390/mm: Add cond_resched() to cmm_alloc/free_pages()
  s390/pai_ext: Update PAI extension 1 counters
  s390/pai_crypto: Add support for MSA 10 and 11 pai counters
  ...
2024-09-21 09:02:54 -07:00
Linus Torvalds
617a814f14 ALong with the usual shower of singleton patches, notable patch series in
this pull request are:
 
 "Align kvrealloc() with krealloc()" from Danilo Krummrich.  Adds
 consistency to the APIs and behaviour of these two core allocation
 functions.  This also simplifies/enables Rustification.
 
 "Some cleanups for shmem" from Baolin Wang.  No functional changes - mode
 code reuse, better function naming, logic simplifications.
 
 "mm: some small page fault cleanups" from Josef Bacik.  No functional
 changes - code cleanups only.
 
 "Various memory tiering fixes" from Zi Yan.  A small fix and a little
 cleanup.
 
 "mm/swap: remove boilerplate" from Yu Zhao.  Code cleanups and
 simplifications and .text shrinkage.
 
 "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt.  This
 is a feature, it adds new feilds to /proc/vmstat such as
 
     $ grep kstack /proc/vmstat
     kstack_1k 3
     kstack_2k 188
     kstack_4k 11391
     kstack_8k 243
     kstack_16k 0
 
 which tells us that 11391 processes used 4k of stack while none at all
 used 16k.  Useful for some system tuning things, but partivularly useful
 for "the dynamic kernel stack project".
 
 "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
 Teaches kmemleak to detect leaksage of percpu memory.
 
 "mm: memcg: page counters optimizations" from Roman Gushchin.  "3
 independent small optimizations of page counters".
 
 "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
 Hildenbrand.  Improves PTE/PMD splitlock detection, makes powerpc/8xx work
 correctly by design rather than by accident.
 
 "mm: remove arch_make_page_accessible()" from David Hildenbrand.  Some
 folio conversions which make arch_make_page_accessible() unneeded.
 
 "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
 Cleans up and fixes our handling of the resetting of the cgroup/process
 peak-memory-use detector.
 
 "Make core VMA operations internal and testable" from Lorenzo Stoakes.
 Rationalizaion and encapsulation of the VMA manipulation APIs.  With a
 view to better enable testing of the VMA functions, even from a
 userspace-only harness.
 
 "mm: zswap: fixes for global shrinker" from Takero Funaki.  Fix issues in
 the zswap global shrinker, resulting in improved performance.
 
 "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao.  Fill in
 some missing info in /proc/zoneinfo.
 
 "mm: replace follow_page() by folio_walk" from David Hildenbrand.  Code
 cleanups and rationalizations (conversion to folio_walk()) resulting in
 the removal of follow_page().
 
 "improving dynamic zswap shrinker protection scheme" from Nhat Pham.  Some
 tuning to improve zswap's dynamic shrinker.  Significant reductions in
 swapin and improvements in performance are shown.
 
 "mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
 Improvements to the new unaccepted memory feature,
 
 "mm/mprotect: Fix dax puds" from Peter Xu.  Implements mprotect on DAX
 PUDs.  This was missing, although nobody seems to have notied yet.
 
 "Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
 Cleanups and modest performance improvements for the maple tree library
 code.
 
 "memcg: further decouple v1 code from v2" from Shakeel Butt.  Move more
 cgroup v1 remnants away from the v2 memcg code.
 
 "memcg: initiate deprecation of v1 features" from Shakeel Butt.  Adds
 various warnings telling users that memcg v1 features are deprecated.
 
 "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
 Greatly improves the success rate of the mTHP swap allocation.
 
 "mm: introduce numa_memblks" from Mike Rapoport.  Moves various disparate
 per-arch implementations of numa_memblk code into generic code.
 
 "mm: batch free swaps for zap_pte_range()" from Barry Song.  Greatly
 improves the performance of munmap() of swap-filled ptes.
 
 "support large folio swap-out and swap-in for shmem" from Baolin Wang.
 With this series we no longer split shmem large folios into simgle-page
 folios when swapping out shmem.
 
 "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao.  Nice performance
 improvements and code reductions for gigantic folios.
 
 "support shmem mTHP collapse" from Baolin Wang.  Adds support for
 khugepaged's collapsing of shmem mTHP folios.
 
 "mm: Optimize mseal checks" from Pedro Falcato.  Fixes an mprotect()
 performance regression due to the addition of mseal().
 
 "Increase the number of bits available in page_type" from Matthew Wilcox.
 Increases the number of bits available in page_type!
 
 "Simplify the page flags a little" from Matthew Wilcox.  Many legacy page
 flags are now folio flags, so the page-based flags and their
 accessors/mutators can be removed.
 
 "mm: store zero pages to be swapped out in a bitmap" from Usama Arif.  An
 optimization which permits us to avoid writing/reading zero-filled zswap
 pages to backing store.
 
 "Avoid MAP_FIXED gap exposure" from Liam Howlett.  Fixes a race window
 which occurs when a MAP_FIXED operqtion is occurring during an unrelated
 vma tree walk.
 
 "mm: remove vma_merge()" from Lorenzo Stoakes.  Major rotorooting of the
 vma_merge() functionality, making ot cleaner, more testable and better
 tested.
 
 "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.  Minor
 fixups of DAMON selftests and kunit tests.
 
 "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.  Code
 cleanups and folio conversions.
 
 "Shmem mTHP controls and stats improvements" from Ryan Roberts.  Cleanups
 for shmem controls and stats.
 
 "mm: count the number of anonymous THPs per size" from Barry Song.  Expose
 additional anon THP stats to userspace for improved tuning.
 
 "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
 conversions and removal of now-unused page-based APIs.
 
 "replace per-quota region priorities histogram buffer with per-context
 one" from SeongJae Park.  DAMON histogram rationalization.
 
 "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
 Park.  DAMON documentation updates.
 
 "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
 related doc and warn" from Jason Wang: fixes usage of page allocator
 __GFP_NOFAIL and GFP_ATOMIC flags.
 
 "mm: split underused THPs" from Yu Zhao.  Improve THP=always policy - this
 was overprovisioning THPs in sparsely accessed memory areas.
 
 "zram: introduce custom comp backends API" frm Sergey Senozhatsky.  Add
 support for zram run-time compression algorithm tuning.
 
 "mm: Care about shadow stack guard gap when getting an unmapped area" from
 Mark Brown.  Fix up the various arch_get_unmapped_area() implementations
 to better respect guard areas.
 
 "Improve mem_cgroup_iter()" from Kinsey Ho.  Improve the reliability of
 mem_cgroup_iter() and various code cleanups.
 
 "mm: Support huge pfnmaps" from Peter Xu.  Extends the usage of huge
 pfnmap support.
 
 "resource: Fix region_intersects() vs add_memory_driver_managed()" from
 Huang Ying.  Fix a bug in region_intersects() for systems with CXL memory.
 
 "mm: hwpoison: two more poison recovery" from Kefeng Wang.  Teaches a
 couple more code paths to correctly recover from the encountering of
 poisoned memry.
 
 "mm: enable large folios swap-in support" from Barry Song.  Support the
 swapin of mTHP memory into appropriately-sized folios, rather than into
 single-page folios.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
 jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
 AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
 =s0T+
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Along with the usual shower of singleton patches, notable patch series
  in this pull request are:

   - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
     consistency to the APIs and behaviour of these two core allocation
     functions. This also simplifies/enables Rustification.

   - "Some cleanups for shmem" from Baolin Wang. No functional changes -
     mode code reuse, better function naming, logic simplifications.

   - "mm: some small page fault cleanups" from Josef Bacik. No
     functional changes - code cleanups only.

   - "Various memory tiering fixes" from Zi Yan. A small fix and a
     little cleanup.

   - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
     simplifications and .text shrinkage.

   - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
     Butt. This is a feature, it adds new feilds to /proc/vmstat such as

       $ grep kstack /proc/vmstat
       kstack_1k 3
       kstack_2k 188
       kstack_4k 11391
       kstack_8k 243
       kstack_16k 0

     which tells us that 11391 processes used 4k of stack while none at
     all used 16k. Useful for some system tuning things, but
     partivularly useful for "the dynamic kernel stack project".

   - "kmemleak: support for percpu memory leak detect" from Pavel
     Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.

   - "mm: memcg: page counters optimizations" from Roman Gushchin. "3
     independent small optimizations of page counters".

   - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
     David Hildenbrand. Improves PTE/PMD splitlock detection, makes
     powerpc/8xx work correctly by design rather than by accident.

   - "mm: remove arch_make_page_accessible()" from David Hildenbrand.
     Some folio conversions which make arch_make_page_accessible()
     unneeded.

   - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
     Finkel. Cleans up and fixes our handling of the resetting of the
     cgroup/process peak-memory-use detector.

   - "Make core VMA operations internal and testable" from Lorenzo
     Stoakes. Rationalizaion and encapsulation of the VMA manipulation
     APIs. With a view to better enable testing of the VMA functions,
     even from a userspace-only harness.

   - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
     issues in the zswap global shrinker, resulting in improved
     performance.

   - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
     in some missing info in /proc/zoneinfo.

   - "mm: replace follow_page() by folio_walk" from David Hildenbrand.
     Code cleanups and rationalizations (conversion to folio_walk())
     resulting in the removal of follow_page().

   - "improving dynamic zswap shrinker protection scheme" from Nhat
     Pham. Some tuning to improve zswap's dynamic shrinker. Significant
     reductions in swapin and improvements in performance are shown.

   - "mm: Fix several issues with unaccepted memory" from Kirill
     Shutemov. Improvements to the new unaccepted memory feature,

   - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
     DAX PUDs. This was missing, although nobody seems to have notied
     yet.

   - "Introduce a store type enum for the Maple tree" from Sidhartha
     Kumar. Cleanups and modest performance improvements for the maple
     tree library code.

   - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
     more cgroup v1 remnants away from the v2 memcg code.

   - "memcg: initiate deprecation of v1 features" from Shakeel Butt.
     Adds various warnings telling users that memcg v1 features are
     deprecated.

   - "mm: swap: mTHP swap allocator base on swap cluster order" from
     Chris Li. Greatly improves the success rate of the mTHP swap
     allocation.

   - "mm: introduce numa_memblks" from Mike Rapoport. Moves various
     disparate per-arch implementations of numa_memblk code into generic
     code.

   - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
     improves the performance of munmap() of swap-filled ptes.

   - "support large folio swap-out and swap-in for shmem" from Baolin
     Wang. With this series we no longer split shmem large folios into
     simgle-page folios when swapping out shmem.

   - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
     performance improvements and code reductions for gigantic folios.

   - "support shmem mTHP collapse" from Baolin Wang. Adds support for
     khugepaged's collapsing of shmem mTHP folios.

   - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
     performance regression due to the addition of mseal().

   - "Increase the number of bits available in page_type" from Matthew
     Wilcox. Increases the number of bits available in page_type!

   - "Simplify the page flags a little" from Matthew Wilcox. Many legacy
     page flags are now folio flags, so the page-based flags and their
     accessors/mutators can be removed.

   - "mm: store zero pages to be swapped out in a bitmap" from Usama
     Arif. An optimization which permits us to avoid writing/reading
     zero-filled zswap pages to backing store.

   - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
     window which occurs when a MAP_FIXED operqtion is occurring during
     an unrelated vma tree walk.

   - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
     the vma_merge() functionality, making ot cleaner, more testable and
     better tested.

   - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
     Minor fixups of DAMON selftests and kunit tests.

   - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
     Code cleanups and folio conversions.

   - "Shmem mTHP controls and stats improvements" from Ryan Roberts.
     Cleanups for shmem controls and stats.

   - "mm: count the number of anonymous THPs per size" from Barry Song.
     Expose additional anon THP stats to userspace for improved tuning.

   - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
     folio conversions and removal of now-unused page-based APIs.

   - "replace per-quota region priorities histogram buffer with
     per-context one" from SeongJae Park. DAMON histogram
     rationalization.

   - "Docs/damon: update GitHub repo URLs and maintainer-profile" from
     SeongJae Park. DAMON documentation updates.

   - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
     improve related doc and warn" from Jason Wang: fixes usage of page
     allocator __GFP_NOFAIL and GFP_ATOMIC flags.

   - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
     This was overprovisioning THPs in sparsely accessed memory areas.

   - "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
     Add support for zram run-time compression algorithm tuning.

   - "mm: Care about shadow stack guard gap when getting an unmapped
     area" from Mark Brown. Fix up the various arch_get_unmapped_area()
     implementations to better respect guard areas.

   - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
     of mem_cgroup_iter() and various code cleanups.

   - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
     pfnmap support.

   - "resource: Fix region_intersects() vs add_memory_driver_managed()"
     from Huang Ying. Fix a bug in region_intersects() for systems with
     CXL memory.

   - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
     a couple more code paths to correctly recover from the encountering
     of poisoned memry.

   - "mm: enable large folios swap-in support" from Barry Song. Support
     the swapin of mTHP memory into appropriately-sized folios, rather
     than into single-page folios"

* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
  zram: free secondary algorithms names
  uprobes: turn xol_area->pages[2] into xol_area->page
  uprobes: introduce the global struct vm_special_mapping xol_mapping
  Revert "uprobes: use vm_special_mapping close() functionality"
  mm: support large folios swap-in for sync io devices
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
  mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
  set_memory: add __must_check to generic stubs
  mm/vma: return the exact errno in vms_gather_munmap_vmas()
  memcg: cleanup with !CONFIG_MEMCG_V1
  mm/show_mem.c: report alloc tags in human readable units
  mm: support poison recovery from copy_present_page()
  mm: support poison recovery from do_cow_fault()
  resource, kunit: add test case for region_intersects()
  resource: make alloc_free_mem_region() works for iomem_resource
  mm: z3fold: deprecate CONFIG_Z3FOLD
  vfio/pci: implement huge_fault support
  mm/arm64: support large pfn mappings
  mm/x86: support large pfn mappings
  ...
2024-09-21 07:29:05 -07:00
Heiko Carstens
b920aa77be s390/vdso: Wire up getrandom() vdso implementation
Provide the s390 specific vdso getrandom() architecture backend.

_vdso_rng_data required data is placed within the _vdso_data vvar page,
by using a hardcoded offset larger than vdso_data.

As required the chacha20 implementation does not write to the stack.

The implementation follows more or less the arm64 implementations and
makes use of vector instructions. It has a fallback to the getrandom()
system call for machines where the vector facility is not installed.

The check if the vector facility is installed, as well as an
optimization for machines with the vector-enhancements facility 2, is
implemented with alternatives, avoiding runtime checks.

Note that __kernel_getrandom() is implemented without the vdso user
wrapper which would setup a stack frame for odd cases (aka very old
glibc variants) where the caller has not done that. All callers of
__kernel_getrandom() are required to setup a stack frame, like the C ABI
requires it.

The vdso testcases vdso_test_getrandom and vdso_test_chacha pass.

Benchmark on a z16:

    $ ./vdso_test_getrandom bench-single
       vdso: 25000000 times in 0.493703559 seconds
    syscall: 25000000 times in 6.584025337 seconds

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13 20:57:31 +02:00
Heiko Carstens
c1ae1b4ef5 s390/vdso: Move vdso symbol handling to separate header file
The vdso.h header file, which is included at many places, includes
generated header files. This can easily lead to recursive header file
inclusions if the vdso code is changed.

Therefore move the vdso symbol code, which requires the generated
header files, to a separate header file, and include it at the two
locations which require it.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13 17:28:36 +02:00
Heiko Carstens
e10863fffe s390/vdso: Allow alternatives in vdso code
Implement the infrastructure required to allow alternatives in vdso code.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13 17:28:36 +02:00
Heiko Carstens
013e984397 s390/alternatives: Remove ALT_FACILITY_EARLY
Patch all alternatives which depend on facilities from the decompressor.
There is no technical reason which enforces to split patching of such
alternatives to the decompressor and the kernel.

This simplifies alternative handling a bit, since one alternative type is
removed.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13 17:28:36 +02:00
Finn Callies
9fed8d7c46 s390/crypto: Display Query and Query Authentication Information in sysfs
Displays the query (fc=0) and query authentication information (fc=127)
as binary in sysfs per CPACF instruction. Files are located in
/sys/devices/system/cpu/cpacf/. These information can be fetched via
asm already except for PCKMO because this instruction is privileged. To
offer a unified interface all CPACF instructions will have this
information displayed in sysfs in files <instruction>_query_raw and
<instruction>_query_auth_info_raw.

A new tool introduced into s390-tools called cpacfinfo will use this
information to convert and display in human readable form.

Suggested-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Finn Callies <fcallies@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-12 14:13:27 +02:00
Jens Remus
ab22f8d908 s390/disassembler: Remove duplicate instruction format RSY_RDRU
Instruction format RSY_RDRU is a duplicate of RSY_RURD2. Use the latter,
as it follows the s390-specific conventions for instruction format
naming used in binutils.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-12 14:13:27 +02:00
Heiko Carstens
ebcc369f18 s390: Use MARCH_HAS_*_FEATURES defines
Replace CONFIG_HAVE_MARCH_*_FEATURES with MARCH_HAS_*_FEATURES
everywhere so code gets compiled correctly depending on if the
target is the kernel or the decompressor.

Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-09-07 17:12:42 +02:00
Thomas Richter
0114009953 s390/pai_ext: Update PAI extension 1 counters
Update the internal array of PAI extension 1 NNPA
counter string table to support specialized processor
instrumentation assist instructions.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-09-05 15:17:23 +02:00
Thomas Richter
4ae48555d0 s390/pai_crypto: Add support for MSA 10 and 11 pai counters
Update the internal array of PAI crypto counter string
table with new counters supported with Message Security
Assist extension (MSA) 10 and MSA 11.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Finn Callies <fcallies@linux.ibm.com>
Tested-by: Finn Callies <fcallies@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-09-05 15:17:23 +02:00
Mike Rapoport (Microsoft)
46bcce5031 arch, mm: move definition of node_data to generic code
Every architecture that supports NUMA defines node_data in the same way:

	struct pglist_data *node_data[MAX_NUMNODES];

No reason to keep multiple copies of this definition and its forward
declarations, especially when such forward declaration is the only thing
in include/asm/mmzone.h for many architectures.

Add definition and declaration of node_data to generic code and drop
architecture-specific versions.

Link: https://lkml.kernel.org/r/20240807064110.1003856-8-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Rob Herring (Arm) <robh@kernel.org>
Cc: Samuel Holland <samuel.holland@sifive.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03 21:15:28 -07:00
David Hildenbrand
85a7e5432d s390/uv: convert gmap_destroy_page() from follow_page() to folio_walk
Let's get rid of another follow_page() user and perform the UV calls under
PTL -- which likely should be fine.

No need for an additional reference while holding the PTL:
uv_destroy_folio() and uv_convert_from_secure_folio() raise the refcount,
so any concurrent make_folio_secure() would see an unexpted reference and
cannot set PG_arch_1 concurrently.

Do we really need a writable PTE?  Likely yes, because the "destroy" part
is, in comparison to the export, a destructive operation.  So we'll keep
the writability check for now.

We'll lose the secretmem check from follow_page().  Likely we don't care
about that here.

Link: https://lkml.kernel.org/r/20240802155524.517137-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:26:01 -07:00
David Hildenbrand
3290ef3c7f s390/uv: drop arch_make_page_accessible()
All code was converted to using arch_make_folio_accessible(), let's drop
arch_make_page_accessible().

Link: https://lkml.kernel.org/r/20240729183844.388481-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:53 -07:00
Mete Durlu
ea31f1c6b4 s390/hiperdispatch: Add hiperdispatch debug counters
Add three counters to follow and understand hiperdispatch behavior;
* adjustment_count (amount of capacity adjustments triggered)
* greedy_time_ms (time spent while all cpus are on high capacity)
* conservative_time_ms (time spent while only entitled cpus are on high
  capacity)

These counters can be found under /sys/kernel/debug/s390/hiperdispatch/
Time counters are in <msec> format and only cover the time spent
when hiperdispatch is active.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Mete Durlu
441cc6f5b6 s390/hiperdispatch: Add hiperdispatch debug attributes
Add two attributes for debug purposes. They can be found under;
/sys/devices/system/cpu/hiperdispatch/
* hd_stime_threshold : allows user to adjust steal time threshold
* hd_delay_factor : allows user to adjust delay factor of hiperdispatch
                    work (after topology updates, delayed work is
                    always delayed extra by this factor)

hd_stime_threshold can have values between 0-100 as it represents a
percentage value.
hd_delay_factor can have values greater than 1. It is multiplied with
the default delay to achieve a longer interval, pushing back the next
hiperdispatch adjustment after a topology update.

Ex:
if delay interval is 250ms and the delay factor is 4;
delayed interval is now 1000ms(1sec). After each capacity adjustment
or topology change, work has a delayed interval of 1 sec for one
interval.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Mete Durlu
b9271a5334 s390/hiperdispatch: Add hiperdispatch sysctl interface
Expose hiperdispatch controls via sysctl. The user can now toggle
hiperdispatch via assigning 0 or 1 to s390.hiperdispatch attribute.
When hiperdipatch is toggled on, it tries to adjust CPU capacities,
while system is in vertical polarization to gain performance benefits
from different CPU polarizations. Disabling hiperdispatch reverts the
CPU capacities to their default (HIGH_CAPACITY) and stops the dynamic
adjustments.

Introduce a kconfig option HIPERDISPATCH_ON which allows users to
use hiperdispatch by default on vertical polarization. Using the
sysctl attribute s390.hiperdispatch would overwrite this behavior.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Mete Durlu
1e5aa12d47 s390/hiperdispatch: Add trace events
Add trace events to debug hiperdispatch behavior and track domain
rebuilding. Two events provide information about the decision making of
hiperdispatch and the adjustments made.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Co-developed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Mete Durlu
c0d4ba054f s390/hiperdispatch: Add steal time averaging
The measurements done by hiperdispatch can have sudden spikes and dips
during run time. To prevent these outliers effecting the decision making
process and causing adjustment overhead, use weighted average of the
steal time.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Co-developed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Mete Durlu
6843d6d97c s390/hiperdispatch: Introduce hiperdispatch
When LPAR is in vertical polarization, CPUs get different polarization
values, namely vertical high, vertical medium and vertical low. These
values represent the likelyhood of the CPU getting physical runtime.
Vertical high CPUs will always get runtime and others get varying
runtime depending on the load the CEC is under.

Vertical high and vertical medium CPUs are considered the CPUs which the
current LPAR has the entitlement to run on. The vertical lows are on the
other hand are borrowed CPUs which would only be given to the LPAR by
hipervisor when the other LPARs are not utilizing them.

Using the CPU capacities, hint linux scheduler when it should prioritise
vertical high and vertical medium CPUs over vertical low CPUs.
By tracking various system statistics hiperdispatch determines when to
adjust cpu capacities.
After each adjustment, rebuilding of scheduler domains is necessary to
notify the scheduler about capacity changes but since this operation is
costly it should be done as sparsely as possible.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Co-developed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Mete Durlu
26ceef523d s390/smp: Add cpu capacities
Linux scheduler allows architectures to assign capacity values to
individual CPUs. This hints scheduler the performance difference between
CPUs and allows more efficient task distribution them. Implement
helper methods to set and get CPU capacities for s390. This is
particularly helpful in vertical polarization configurations of LPARs.
On vertical polarization an LPARs CPUs can get different polarization
values depending on the CEC configuration. CPUs with different
polarization values can perform different from each other, using CPU
capacities this can be reflected to linux scheduler.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Tobias Huschle
7e627f8193 s390/topology: Add config option to switch to vertical during boot
By default, all systems on s390 start in horizontal cpu polarization.
Selecting the new config option SCHED_TOPOLOGY_VERTICAL allows to build
a kernel that switches to vertical polarization during boot.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Tobias Huschle
9dd333e7af s390/topology: Add sysctl handler for polarization
Provide an additional path to set the polarization of the system, such
that a user no longer relies on the sysfs interface only and is able
configure the polarization for every reboot via sysctl control files.

The new sysctl can be set as follows:
 - s390.polarization=0 for horizontal polarization
 - s390.polarization=1 for vertical polarization

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Tobias Huschle
307b675cf0 s390/wti: Add debugfs file to display missed grace periods per cpu
Introduce a new debug file which allows to determine how many warning
track grace periods were missed on each CPU.
The new file can be found as /sys/kernel/debug/s390/wti

It is formatted as:
       CPU0       CPU1   [...]    CPUx
        xyz        xyz   [...]     xyz

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:35 +02:00
Tobias Huschle
42419bcdfd s390/wti: Add wti accounting for missed grace periods
A virtual CPU that has received a warning-track interrupt may fail to
acknowledge the interrupt within the warning-track grace period.
While this is usually not a problem, it will become necessary to
investigate if there is a large number of such missed warning-track
interrupts. Therefore, it is necessary to track these events.
The information is tracked through the s390 debug facility and can be
found under /sys/kernel/debug/s390dbf/wti/.

The hex_ascii output is formatted as:
 <pid> <symbol>

The values pid and current psw are collected when a warning track
interrupt is received. Symbol is either the kernel symbol matching the
collected psw or redacted to <user> when running in user space.

Each line represents the currently executing process when a warning
track interrupt was received which was then not acknowledged within its
grace period.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Tobias Huschle
cafeff5a03 s390/wti: Prepare graceful CPU pre-emption on wti reception
When a warning track interrupt is received, the kernel has only a very
limited amount of time to make sure, that the CPU can be yielded as
gracefully as possible before being pre-empted by the hypervisor.

The interrupt handler for the wti therefore unparks a kernel thread
which has being created on boot re-using the CPU hotplug kernel thread
infrastructure. These threads exist per CPU and are assigned the
highest possible real-time priority. This makes sure, that said threads
will execute as soon as possible as the scheduler should pre-empt any
other running user tasks to run the real-time thread.

Furthermore, the interrupt handler disables all I/O interrupts to
prevent additional interrupt processing on the soon-preempted CPU.
Interrupt handlers are likely to take kernel locks, which in the worst
case, will be kept while the interrupt handler is pre-empted from itself
underlying physical CPU. In that case, all tasks or interrupt handlers
on other CPUs would have to wait for the pre-empted CPU being dispatched
again. By preventing further interrupt processing, this risk is
minimized.

Once the CPU gets dispatched again, the real-time kernel thread regains
control, reenables interrupts and parks itself again.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Tobias Huschle
2c6c9ccc76 s390/wti: Introduce infrastructure for warning track interrupt
The warning-track interrupt (wti) provides a notification that the
receiving CPU will be pre-empted from its physical CPU within a short
time frame. This time frame is called grace period and depends on the
machine type. Giving up the CPU on time may prevent a task to get stuck
while holding a resource.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Vasily Gorbik
36dff49b96 s390/ftrace: Avoid extra serialization for graph caller patching
The only context where ftrace_enable_ftrace_graph_caller()
or ftrace_disable_ftrace_graph_caller() is called also calls
ftrace_arch_code_modify_post_process(), which already performs
text_poke_sync_lock().

ftrace_run_update_code()
	arch_ftrace_update_code()
		ftrace_modify_all_code()
			ftrace_enable_ftrace_graph_caller()/ftrace_disable_ftrace_graph_caller()
	ftrace_arch_code_modify_post_process()
		text_poke_sync_lock()

Remove the redundant serialization.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Vasily Gorbik
5200614080 s390/ftrace: Use get/copy_from_kernel_nofault consistently
Use get/copy_from_kernel_nofault to access the kernel text consistently.
Replace memcmp() in ftrace_init_nop() to ensure that in case of
inconsistencies in the 'mcount' table, the kernel reports a failure
instead of potentially crashing.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Vasily Gorbik
efd9cd019e s390/ftrace: Avoid trampolines if possible
When a sequential instruction fetching facility is present, it is safe
to patch ftrace NOPs in function prologues. All of them are 8-byte
aligned.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Vasily Gorbik
30799152c3 s390/kprobes: Avoid stop machine if possible
Avoid stop machine on kprobes arm/disarm when sequential instruction
fetching is present.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Vasily Gorbik
bb91ed0ee3 s390/setup: Recognize sequential instruction fetching facility
When sequential instruction fetching facility is present,
certain guarantees are provided for code patching. In particular,
atomic overwrites within 8 aligned bytes is safe from an
instruction-fetching point of view.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Sven Schnelle
ee3daf7c05 s390/entry: Unify save_area_sync and save_area_async
In the past two save areas existed because interrupt handlers
and system call / program check handlers where entered with
interrupts enabled. To prevent a handler from overwriting the
save areas from the previous handler, interrupts used the async
save area, while system call and program check handler used the
sync save area.

Since the removal of critical section cleanup from entry.S, handlers are
entered with interrupts disabled. When the interrupts are re-enabled,
the save area is no longer need. Therefore merge both save areas into one.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Vasily Gorbik
acb684d3b0 s390/disassembler: Add instructions
Add more instructions to the kernel disassembler.

Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:33 +02:00
Jens Remus
73c81973b4 s390/disassembler: Use proper format specifiers for operand values
Treat register numbers as unsigned. Treat signed operand values as
signed.

This resolves multiple instances of the Cppcheck warning:

warning: %i in format string (no. 1) requires 'int' but the argument
  type is 'unsigned int'. [invalidPrintfArgType_sint]

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:33 +02:00
Vasily Gorbik
a84dd0d8ae s390/ftrace: Avoid calling unwinder in ftrace_return_address()
ftrace_return_address() is called extremely often from
performance-critical code paths when debugging features like
CONFIG_TRACE_IRQFLAGS are enabled. For example, with debug_defconfig,
ftrace selftests on my LPAR currently execute ftrace_return_address()
as follows:

ftrace_return_address(0) - 0 times (common code uses __builtin_return_address(0) instead)
ftrace_return_address(1) - 2,986,805,401 times (with this patch applied)
ftrace_return_address(2) - 140 times
ftrace_return_address(>2) - 0 times

The use of __builtin_return_address(n) was replaced by return_address()
with an unwinder call by commit cae74ba8c2 ("s390/ftrace:
Use unwinder instead of __builtin_return_address()") because
__builtin_return_address(n) simply walks the stack backchain and doesn't
check for reaching the stack top. For shallow stacks with fewer than
"n" frames, this results in reads at low addresses and random
memory accesses.

While calling the fully functional unwinder "works", it is very slow
for this purpose. Moreover, potentially following stack switches and
walking past IRQ context is simply wrong thing to do for
ftrace_return_address().

Reimplement return_address() to essentially be __builtin_return_address(n)
with checks for reaching the stack top. Since the ftrace_return_address(n)
argument is always a constant, keep the implementation in the header,
allowing both GCC and Clang to unroll the loop and optimize it to the
bare minimum.

Fixes: cae74ba8c2 ("s390/ftrace: Use unwinder instead of __builtin_return_address()")
Cc: stable@vger.kernel.org
Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-27 20:16:48 +02:00
Vasily Gorbik
d759be2823 s390/ftrace: Use kernel ftrace trampoline for modules
Now that both the kernel modules area and the kernel image itself are
located within 4 GB, there is no longer a need to maintain a separate
ftrace_plt trampoline. Use the existing trampoline in the kernel.

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-27 20:16:48 +02:00
Vasily Gorbik
017f1f0d39 s390/ftrace: Remove unused ftrace_plt_template*
Unused since commit b860b9346e ("s390/ftrace: remove dead code").

Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-27 20:16:48 +02:00
Heiko Carstens
6708948e36 s390/early: Dump register contents and call trace for early crashes
If the early program check handler cannot resolve a program check dump
register contents and a call trace to the console before loading a disabled
wait psw. This makes debugging much easier.

Emit an extra message with early_printk() for cases where regular printk()
via the early console is not yet working so that at least some information
is available.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-22 19:28:11 +02:00
Heiko Carstens
0bc6a69f5f s390/early: Add __init to __do_early_pgm_check()
__do_early_pgm_check() is a function which is only needed during early
setup code. Mark it __init in order to save a few bytes.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-22 19:28:11 +02:00
Thomas Richter
b495e71015 s390/cpum_sf: Remove WARN_ON_ONCE statements
Remove WARN_ON_ONCE statements. These have not triggered in the
past.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-22 19:28:11 +02:00
Thomas Richter
14a34130e0 s390/cpum_sf: Rework debug_sprintf_event() messages
Rework debug messages:
 - Remove most of the debug_sprintf_event() invocations.
 - Do not split string format statements
 - Remove colon after function name.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-22 19:28:10 +02:00
Alexander Gordeev
1642285e51 s390/boot: Fix KASLR base offset off by __START_KERNEL bytes
Symbol offsets to the KASLR base do not match symbol address in
the vmlinux image. That is the result of setting the KASLR base
to the beginning of .text section as result of an optimization.

Revert that optimization and allocate virtual memory for the
whole kernel image including __START_KERNEL bytes as per the
linker script. That allows keeping the semantics of the KASLR
base offset in sync with other architectures.

Rename __START_KERNEL to TEXT_OFFSET, since it represents the
offset of the .text section within the kernel image, rather than
a virtual address.

Still skip mapping TEXT_OFFSET bytes to save memory on pgtables
and provoke exceptions in case an attempt to access this area is
made, as no kernel symbol may reside there.

In case CONFIG_KASAN is enabled the location counter might exceed
the value of TEXT_OFFSET, while the decompressor linker script
forcefully resets it to TEXT_OFFSET, which leads to a sections
overlap link failure. Use MAX() expression to avoid that.

Reported-by: Omar Sandoval <osandov@osandov.com>
Closes: https://lore.kernel.org/linux-s390/ZnS8dycxhtXBZVky@telecaster.dhcp.thefacebook.com/
Fixes: 56b1069c40 ("s390/boot: Rework deployment of the kernel image")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-22 19:24:13 +02:00
Thomas Richter
6d9a732d8a s390/cpum_sf: Ignore qsi() return code
qsi() executes the instruction qsi (query sample information)
and stores the result of the query in a sample information block
pointed to by the function argument. The instruction does not
change the condition code register. The return code is always
zero. No need to check for errors. Remove now unreferenced
macros PMC_FAILURE and RS_INIT_FAILURE_QSI.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-21 16:17:01 +02:00
Thomas Richter
742a755716 s390/cpum_sf: Ignore lsctl() return code in sf_disable()
sf_disable() returns the condition code of instruction lsctl (load
sampling controls). However the parameter to lsctl() in
sf_disable() is a sample control block containing
all zeroes. This invocation of lsctl() does not fail and returns
always zero even when there is no authorization for sampling
on the machine. In short, sampling can be always turned off.
Ignore the return code of sf_disable() and change the function
return to void.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-21 16:17:01 +02:00
Alexander Gordeev
a3ca27c405 s390/mm: Prevent lowcore vs identity mapping overlap
The identity mapping position in virtual memory is randomized
together with the kernel mapping. That position can never
overlap with the lowcore even when the lowcore is relocated.

Prevent overlapping with the lowcore to allow independent
positioning of the identity mapping. With the current value
of the alternative lowcore address of 0x70000 the overlap
could happen in case the identity mapping is placed at zero.

This is a prerequisite for uncoupling of randomization base
of kernel image and identity mapping in virtual memory.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-21 16:14:45 +02:00
Jann Horn
92a10d3861 runtime constants: move list of constants to vmlinux.lds.h
Refactor the list of constant variables into a macro.
This should make it easier to add more constants in the future.

Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-08-19 09:48:14 +02:00
Heiko Carstens
85878ff1b3 s390/entry: Move early_pgm_check_handler() to init text section
Save some bytes and move early_pgm_check_handler() to init text
section.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:53 +02:00
Heiko Carstens
3c4d0ae067 s390/traps: Handle early warnings gracefully
Add missing warning handling to the early program check handler. This
way a warning is printed to the console as soon as the early console
is setup, and the kernel continues to boot.

Before this change a disabled wait psw was loaded instead and the
machine was silently stopped without giving an idea about what
happened.

Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:53 +02:00
Heiko Carstens
f101b305a7 s390/entry: Make early program check handler relocated lowcore aware
Add the missing pieces so the early program check handler also works
with a relocated lowcore. Right now the result of an early program
check in case of a relocated lowcore would be a program check loop.

Fixes: 8f1e70adb1 ("s390/boot: Add cmdline option to relocate lowcore")
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:53 +02:00
Heiko Carstens
f2bb5b97b5 s390/entry: Move early program check handler to entry.S
Have all program check handlers in one file to make future changes easy.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:53 +02:00
Thomas Richter
e09e58f425 s390/cpum_sf: Use variable name cpuhw consistently
All functions but setup_pmc_cpu() use a local variable named
cpuhw to refer to struct cpu_hw_sf.
In setup_pmc_cpu() rename variable cpusf to cpuhw. This makes
the naming scheme consistent with all other functions.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:53 +02:00
Thomas Richter
6bc565a99e s390/cpum_sf: Define and initialize variable
Define and initialize a variable in one place.
Remove space between cast and variable.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:53 +02:00
Thomas Richter
b201828290 s390/cpum_sf: Use hwc as variable consistently
In hw_perf_event_update() and cpumsf_pmu_enable() use variable hwc
consistently to access event's hardware related data.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:53 +02:00
Thomas Richter
d4559eabc1 s390/cpum_cf: Move defines from header file to source file
The macros PERF_CPUM_CF_MAX_CTR and PERF_EVENT_CPUM_CF_DIAG
are used in only one source file arch/s390/kernel/perf_cpum_cf.c.
Move these defines from the header file
arch/s390/include/asm/perf_event.h to the only user.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:53 +02:00
Thomas Richter
52d6ef92a4 s390/cpum_sf: Move defines from header file to source file
Some defines in common header file arch/s390/include/asm/perf_event.h
are only used in one source file arch/s390/kernel/perf_cpum_sf.c.
Move these defines from header to source file.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:53 +02:00
Thomas Richter
501cab2b1d s390/cpum_sf: Rename macro to consistent prefix
Rename macro SAMPLE_FREQ_MODE to SAMPL_FREQ_MODE to make its
prefix consistent with all other macro starting with prefix
SAMPL_XXX.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:52 +02:00
Thomas Richter
ea95be4bda s390/cpum_sf: Remove unused defines REG_NONE and REG_OVERFLOW
Member hw_perf_event::reg.reg is set but never used, so remove it.
Defines REG_NONE and REG_OVERFLOW are not referenced anymore.
The initialization to zero takes place in function
perf_event_alloc() where
  ...
  event = kmem_cache_alloc_node(perf_event_cache,
		                GFP_KERNEL | __GFP_ZERO, node);
  ...

makes sure memory allocated for the event is zero'ed.
This is done in the kernel's common code in kernel/events/core.c
The struct perf_event contains member hw_perf_event as in

  struct perf_event {
   ....
   struct hw_perf_event hw;
   ....
  };

This contained sub-structure is also initialized to zero.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:52 +02:00
Thomas Richter
8a6fe8f21e s390/cpum_sf: Use refcount_t instead of atomic_t
Replace atomic_t by refcount_t for reference counting of events.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-07 20:52:52 +02:00
Heiko Carstens
75c10d5377 s390/vmlinux.lds.S: Move ro_after_init section behind rodata section
The .data.rel.ro and .got section were added between the rodata and
ro_after_init data section, which adds an RW mapping in between all RO
mapping of the kernel image:

---[ Kernel Image Start ]---
0x000003ffe0000000-0x000003ffe0e00000        14M PMD RO X
0x000003ffe0e00000-0x000003ffe0ec7000       796K PTE RO X
0x000003ffe0ec7000-0x000003ffe0f00000       228K PTE RO NX
0x000003ffe0f00000-0x000003ffe1300000         4M PMD RO NX
0x000003ffe1300000-0x000003ffe1331000       196K PTE RO NX
0x000003ffe1331000-0x000003ffe13b3000       520K PTE RW NX <---
0x000003ffe13b3000-0x000003ffe13d5000       136K PTE RO NX
0x000003ffe13d5000-0x000003ffe1400000       172K PTE RW NX
0x000003ffe1400000-0x000003ffe1500000         1M PMD RW NX
0x000003ffe1500000-0x000003ffe1700000         2M PTE RW NX
0x000003ffe1700000-0x000003ffe1800000         1M PMD RW NX
0x000003ffe1800000-0x000003ffe187e000       504K PTE RW NX
---[ Kernel Image End ]---

Move the ro_after_init data section again right behind the rodata
section to prevent interleaving RO and RW mappings:

---[ Kernel Image Start ]---
0x000003ffe0000000-0x000003ffe0e00000        14M PMD RO X
0x000003ffe0e00000-0x000003ffe0ec7000       796K PTE RO X
0x000003ffe0ec7000-0x000003ffe0f00000       228K PTE RO NX
0x000003ffe0f00000-0x000003ffe1300000         4M PMD RO NX
0x000003ffe1300000-0x000003ffe1353000       332K PTE RO NX
0x000003ffe1353000-0x000003ffe1400000       692K PTE RW NX
0x000003ffe1400000-0x000003ffe1500000         1M PMD RW NX
0x000003ffe1500000-0x000003ffe1700000         2M PTE RW NX
0x000003ffe1700000-0x000003ffe1800000         1M PMD RW NX
0x000003ffe1800000-0x000003ffe187e000       504K PTE RW NX
---[ Kernel Image End ]---

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-31 16:30:20 +02:00
Heiko Carstens
0a34c027a3 s390/alternatives: Remove unused empty header file
Remove the unused and empty arch/s390/kernel/alternative.h header
file which was added by mistake.

Fixes: 5ade5be4ed ("s390: Add infrastructure to patch lowcore accesses")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-31 16:30:20 +02:00
Heiko Carstens
4734406c39 s390/fpu: Re-add exception handling in load_fpu_state()
With the recent rewrite of the fpu code exception handling for the
lfpc instruction within load_fpu_state() was erroneously removed.

Add it again to prevent that loading invalid floating point register
values cause an unhandled specification exception.

Fixes: 8c09871a95 ("s390/fpu: limit save and restore to used registers")
Cc: stable@vger.kernel.org
Reported-by: Aristeu Rozanski <aris@redhat.com>
Tested-by: Aristeu Rozanski <aris@redhat.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-31 16:30:20 +02:00
Linus Torvalds
65ad409e63 more s390 updates for 6.11 merge window
- Fix KMSAN build breakage caused by the conflict between s390 and
   mm-stable trees
 
 - Add KMSAN page markers for ptdump
 
 - Add runtime constant support
 
 - Fix __pa/__va for modules under non-GPL licenses by exporting necessary
   vm_layout struct with EXPORT_SYMBOL to prevent linkage problems
 
 - Fix an endless loop in the CF_DIAG event stop in the CPU Measurement
   Counter Facility code when the counter set size is zero
 
 - Remove the PROTECTED_VIRTUALIZATION_GUEST config option and enable
   its functionality by default
 
 - Support allocation of multiple MSI interrupts per device and improve
   logging of architecture-specific limitations
 
 - Add support for lowcore relocation as a debugging feature to catch
   all null ptr dereferences in the kernel address space, improving
   detection beyond the current implementation's limited write access
   protection
 
 - Clean up and rework CPU alternatives to allow for callbacks and early
   patching for the lowcore relocation
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmajrGUACgkQjYWKoQLX
 FBgZjwf/X8suh1Gm2qO47hdGURusOKmEa6GjYaihKzOi2I5yAWVXGsAYX7QtXI4X
 fxbKyuGJIvq7LOrIojN1JOCPGkDRztgMddqDJI7WljuRiw6dcd4L5tbaPgCKEv3Q
 AQcoq+Aeg1L5xnuNFPdQXl6+Fy2lTFqJCkUl+uW05pGAn2R212dYG3HB41TpwOtJ
 Sv2R5+yD9TQCKnHyuCQqaGf7d6SQTcVeBj8zrqVmcyduNK+BYYMOwlJ/UTRzeZEX
 3DmQg/TdAkxXf0jZ+vrNILEfHlIvwDAhFjdoAXXL0TX4lx2cHLx9AiqNxYhUprsG
 0gutc/nLq2FxhqofoJ0z9TCdb1Ef7w==
 =skva
 -----END PGP SIGNATURE-----

Merge tag 's390-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Vasily Gorbik:

 - Fix KMSAN build breakage caused by the conflict between s390 and
   mm-stable trees

 - Add KMSAN page markers for ptdump

 - Add runtime constant support

 - Fix __pa/__va for modules under non-GPL licenses by exporting
   necessary vm_layout struct with EXPORT_SYMBOL to prevent linkage
   problems

 - Fix an endless loop in the CF_DIAG event stop in the CPU Measurement
   Counter Facility code when the counter set size is zero

 - Remove the PROTECTED_VIRTUALIZATION_GUEST config option and enable
   its functionality by default

 - Support allocation of multiple MSI interrupts per device and improve
   logging of architecture-specific limitations

 - Add support for lowcore relocation as a debugging feature to catch
   all null ptr dereferences in the kernel address space, improving
   detection beyond the current implementation's limited write access
   protection

 - Clean up and rework CPU alternatives to allow for callbacks and early
   patching for the lowcore relocation

* tag 's390-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
  s390: Remove protvirt and kvm config guards for uv code
  s390/boot: Add cmdline option to relocate lowcore
  s390/kdump: Make kdump ready for lowcore relocation
  s390/entry: Make system_call() ready for lowcore relocation
  s390/entry: Make ret_from_fork() ready for lowcore relocation
  s390/entry: Make __switch_to() ready for lowcore relocation
  s390/entry: Make restart_int_handler() ready for lowcore relocation
  s390/entry: Make mchk_int_handler() ready for lowcore relocation
  s390/entry: Make int handlers ready for lowcore relocation
  s390/entry: Make pgm_check_handler() ready for lowcore relocation
  s390/entry: Add base register to CHECK_VMAP_STACK/CHECK_STACK macro
  s390/entry: Add base register to SIEEXIT macro
  s390/entry: Add base register to MBEAR macro
  s390/entry: Make __sie64a() ready for lowcore relocation
  s390/head64: Make startup code ready for lowcore relocation
  s390: Add infrastructure to patch lowcore accesses
  s390/atomic_ops: Disable flag outputs constraint for GCC versions below 14.2.0
  s390/entry: Move SIE indicator flag to thread info
  s390/nmi: Simplify ptregs setup
  s390/alternatives: Remove alternative facility list
  ...
2024-07-26 10:47:53 -07:00
Joel Granados
78eb4ea25c sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24 20:59:29 +02:00
Janosch Frank
6dc2e98d5f s390: Remove protvirt and kvm config guards for uv code
Removing the CONFIG_PROTECTED_VIRTUALIZATION_GUEST ifdefs and config
option as well as CONFIG_KVM ifdefs in uv files.

Having this configurable has been more of a pain than a help.
It's time to remove the ifdefs and the config option.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:33 +02:00
Sven Schnelle
97cee3dd4a s390/kdump: Make kdump ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in store_status()
and __do_machine_kdump().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
361f6ec2fe s390/entry: Make system_call() ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in system_call().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
9b3dcae128 s390/entry: Make ret_from_fork() ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in ret_from_fork().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
7cc86dee44 s390/entry: Make __switch_to() ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in __switch_to().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
4064b71112 s390/entry: Make restart_int_handler() ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in restart_int_handler().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
0001b7bbc5 s390/entry: Make mchk_int_handler() ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in mcck_int_handler().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
bd2c55b307 s390/entry: Make int handlers ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in the ext/io interrupt
handlers.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
9e1e275fa2 s390/entry: Make pgm_check_handler() ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in pgm_check_handler().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
86e08d64ee s390/entry: Add base register to CHECK_VMAP_STACK/CHECK_STACK macro
In preparation of having lowcore at different address than zero,
add the base register to CHECK_VMAP_STACK and CHECK_STACK. No
functional change, because %r0 is passed to the macro.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
6908f8f916 s390/entry: Add base register to SIEEXIT macro
In preparation of having lowcore at different address than zero,
add the base register to SIEEXIT. No functional change, because
%r0 is passed to the macro.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
ca2f0a26c4 s390/entry: Add base register to MBEAR macro
In preparation of having lowcore at different address than zero,
add the base register to MBEAR. No functional change, because
%r0 is passed to the macro.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
12184a4676 s390/entry: Make __sie64a() ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in __sie64a().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
39e8c5d6a4 s390/head64: Make startup code ready for lowcore relocation
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in startup_continue().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Sven Schnelle
5ade5be4ed s390: Add infrastructure to patch lowcore accesses
The s390 architecture defines two special per-CPU data pages
called the "prefix area". In s390-linux terminology this is usually
called "lowcore". This memory area contains system configuration
data like old/new PSW's for system call/interrupt/machine check
handlers and lots of other data. It is normally mapped to logical
address 0. This area can only be accessed when in supervisor mode.

This means that kernel code can dereference NULL pointers, because
accesses to address 0 are allowed. Parts of lowcore can be write
protected, but read accesses and write accesses outside of the write
protected areas are not caught.

To remove this limitation for debugging and testing, remap lowcore to
another address and define a function get_lowcore() which simply
returns the address where lowcore is mapped at. This would normally
introduce a pointer dereference (=memory read). As lowcore is used
for several very often used variables, add code to patch this function
during runtime, so we avoid the memory reads.

For C code get_lowcore() has to be used, for assembly code it is
the GET_LC macro. When using this macro/function a reference is added
to alternative patching. All these locations will be patched to the
actual lowcore location when the kernel is booted or a module is loaded.

To make debugging/bisecting problems easier, this patch adds all the
infrastructure but the lowcore address is still hardwired to 0. This
way the code can be converted on a per function basis, and the
functionality is enabled in a patch after all the functions have
been converted.

Note that this requires at least z16 because the old lpsw instruction
only allowed a 12 bit displacement. z16 introduced lpswey which allows
20 bits (signed), so the lowcore can effectively be mapped from
address 0 - 0x7e000. To use 0x7e000 as address, a 6 byte lgfi
instruction would have to be used in the alternative. To save two
bytes, llilh can be used, but this only allows to set bits 16-31 of
the address. In order to use the llilh instruction, use 0x70000 as
alternative lowcore address. This is still large enough to catch
NULL pointer dereferences into large arrays.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:32 +02:00
Heiko Carstens
fc8eac33ad s390/entry: Move SIE indicator flag to thread info
CIF_SIE indicates if a thread is running in SIE context. This is the
state of a thread and not the CPU. Therefore move this indicator to
thread info.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Heiko Carstens
213400c4af s390/nmi: Simplify ptregs setup
The low level machine check handler code fills the ptregs structure
partially with the register contents present at machine check handler
entry and partially with contents from the machine check save area.

In case of a machine check the contents of all general purpose
registers are saved by the CPU to the machine check save area.
Therefore simplify the code and fill the ptregs structure by only
using the machine check save area as source.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Heiko Carstens
beb8cee06f s390/alternatives: Remove alternative facility list
The alternative and the normal facility list are always identical. Remove
the alternative facility list, which allows to simplify the alternatives
code.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Heiko Carstens
47837a5c74 s390/nospec: Push down alternative handling
The nospec implementation is deeply integrated into the alternatives
code: only for nospec an alternative facility list is implemented and
used by the alternative code, while it is modified by nospec specific
needs.

Push down the nospec alternative handling into the nospec by
introducing a new alternative type and a specific nospec callback to
decide if alternatives should be applied.

Also introduce a new global nobp variable which together with facility
82 can be used to decide if nobp is enabled or not.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Sven Schnelle
7f9d85998f s390/alternatives: Allow early alternative patching in decompressor
Add the required code to patch alternatives early in the decompressor.
This is required for the upcoming lowcore relocation changes, where
alternatives for facility 193 need to get patched before lowcore
alternatives.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Co-developed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Heiko Carstens
b3e0c5f734 s390/alternatives: Rework to allow for callbacks
Rework alternatives to allow for callbacks. With this every
alternative entry has additional data encoded:

- When (aka context) an alternative is supposed to be applied

- The type of an alternative, which allows for type specific handling
  and callbacks

- Extra type specific payload (patch information), which can be passed
  to callbacks in order to decide if an alternative should be applied
  or not

With this only the "late" context is implemented, which means there is
no change to the previous behaviour. All code is just converted to the
more generic new infrastructure.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Heiko Carstens
ace76fac94 s390/alternatives: Move text sync functions
Move all text sync functions from alternative.c to processor.c. This
way there is only minimal code left in alternative.c left, which is a
prerequisite to use the C file within boot code as well.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Heiko Carstens
c77f7354c4 s390/alternatives: Merge both alternative header files
The two alternative header files must stay in sync. This is easier to
achieve within one header file. Therefore merge both of them and have
only one file, like most other architectures.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Heiko Carstens
9be999a612 s390/alternatives: Use consistent naming
The alternative code is using the words facility and feature for the
same. Rename facility to more generic feature everywhere to have
consistent naming.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Sven Schnelle
035248a784 s390/alternatives: Remove noaltinstr option
The current Kernel doesn't boot without alternative patching on
z16 machines. To avoid such bugs in the future, remove the option
disable alternative patching.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Sven Schnelle
d3604ffba1 s390: Move CIF flags to struct pcpu
To allow testing flags for offline CPUs, move the CIF flags
to struct pcpu. To avoid having to calculate the array index
for each access, add a pointer to the pcpu member for the current
cpu to lowcore.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Sven Schnelle
90fc5ac282 s390/smp: Switch pcpu_devices to percpu
In preparation of moving the CIF flags from lowcore to pcpu_devices,
convert the pcpu_devices array to use the percpu infrastructure.
This is required because using the pcpu_devices array as it is would
introduce a performance penalty due to the fact that CPU flags for
multiple CPUs would end up in the same cacheline.

Note that a pointer to the pcpu struct of the IPL CPU is still required.
This is because a restart interrupt can be triggered on an offline CPU.
s390 stores the percpu offset in lowcore, but offline CPUs have no
lowcore area allocated. So percpu data cannot be used from an offline
CPU and we need to get the pcpu pointer for the IPL cpu from somewhere
else.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Sven Schnelle
a795eeaf85 s390/smp: Handle restart interrupt on ipl cpu
The current smp code allows to trigger a restart interrupt on CPUs
offline in linux. To allow using the percpu infrastructure instead
of the pcpu_devices array, switch to the ipl cpu which is always
online before calling do_restart().

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Thomas Richter
e6ce1f12d7 s390/cpum_cf: Fix endless loop in CF_DIAG event stop
Event CF_DIAG reads out complete counter sets using stcctm
instruction. This is done at event start time when the process
starts execution and at event stop time when the process is
removed from the CPU. During removal the difference of each
counter in the counter sets is calculated and saved as raw data
in the ring buffer. This works fine unless the number of counters
in a counter set is zero. This may happen for the extended counter
set. This set is machine specific and the size of the counter
set can be zero even when extended counter set is authorized for
read access.

This case is not handled. cfdiag_diffctr() checks authorization
of the extended counter set. If true the functions assumes
the extended counter set has been saved in a data buffer. However
this is not the case, cfdiag_getctrset() does not save a counter
set with counter set size of zero. This mismatch causes an endless
loop in the counter set readout during event stop handling.

The calculation of the difference of the counters in each counter
now verifies the size of the counter set is non-zero. A counter set
with size zero is skipped.

Fixes: a029a4eab3 ("s390/cpumf: Allow concurrent access for CPU Measurement Counter Facility")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:30 +02:00
Vasily Gorbik
e188e5d5ff s390/setup: Fix __pa/__va for modules under non-GPL licenses
The struct vm_layout contains fields used in __pa/__va calculations. Such
fundamental things have to be exported with EXPORT_SYMBOL to avoid
breakages of out-of-tree modules under non-GPL licenses.

Fixes: 7de0446f0b ("s390/boot: Make identity mapping base address explicit")
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 15:54:58 +02:00
Heiko Carstens
57e29f4c89 s390: Add runtime constant support
Implement the runtime constant infrastructure for s390, allowing the
dcache d_hash() function to be generated using as a constant for hash
table address followed by shift by a constant of the hash index.

This is the s390 variant of commit 94a2bc0f61 ("arm64: add 'runtime
constant' support") and commit e3c92e8171 ("runtime constants: add
x86 architecture support").

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 15:54:58 +02:00
Linus Torvalds
fbc90c042c - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN
walkers") is known to cause a performance regression
   (https://lore.kernel.org/all/3acefad9-96e5-4681-8014-827d6be71c7a@linux.ibm.com/T/#mfa809800a7862fb5bdf834c6f71a3a5113eb83ff).
   Yu has a fix which I'll send along later via the hotfixes branch.
 
 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.
 
 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that.  This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches.  My bad.
 
 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"
 
 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability of
   cgroup writeback"
 
 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache index".
 
 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of the
   zeropage in MAP_SHARED mappings.  I don't see any runtime effects here -
   more a cleanup/understandability/maintainablity thing.
 
 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of
   higher addresses, for aarch64.  The (poorly named) series is
   "Restructure va_high_addr_switch".
 
 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".
 
 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the
   series "Enhance soft hwpoison handling and injection".
 
 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything.  Some landed in this pull.
 
 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has
   simplified migration's use of hardware-offload memory copying.
 
 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".
 
 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code.  This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.
 
 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.
 
 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP.  By default this
   is a no-op - users must opt in vis sysfs controls.  Dramatic
   improvements in pagefault latency are realized.
 
 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".
 
 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".
 
 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".
 
 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".
 
 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances.  A 10x speedup is seen in a microbenchmark.
 
   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.
 
 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".
 
 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.
 
 - Is anyone reading this stuff?  If so, email me!
 
 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.
 
 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".
 
 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.
 
 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".
 
 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE".  It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.
 
 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.
 
 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large folio
   userspace copying.
 
 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers.  From SeongJae Park.
 
 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.
 
 - David Hildenbrand sends along more cleanups, this time against the
   migration code.  The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".
 
 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code.  He addresses this in the series "mm: Fix various
   readahead quirks".
 
 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's self
   testing code.
 
 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code.  The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this.  The series is marked cc:stable.
 
 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.
 
 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion.  The series (which also makes the memcg-v1 code
   Kconfigurable) are
 
   "mm: memcg: separate legacy cgroup v1 code and put under config
   option" and
   "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1"
 
 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.
 
 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of excessive
   correctable memory errors.  In order to permit userspace to monitor and
   handle this situation.
 
 - Kefeng Wang's series "mm: migrate: support poison recover from migrate
   folio" teaches the kernel to appropriately handle migration from
   poisoned source folios rather than simply panicing.
 
 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.
 
 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory utilization.
 
 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare
   refcount increments.  So these paes can first be moved aside if they
   reside in the movable zone or a CMA block.
 
 - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps
   for much faster reading of vma information.  The series is "query VMAs
   from /proc/<pid>/maps".
 
 - In the series "mm: introduce per-order mTHP split counters" Lance Yang
   improves the kernel's presentation of developer information related to
   multisize THP splitting.
 
 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)".  This permits
   userspace to use all available huge page sizes.
 
 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and not
   very useful feature from slab fault injection.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2C+QAKCRDdBJ7gKXxA
 joTkAQDvjqOoFStqk4GU3OXMYB7WCU/ZQMFG0iuu1EEwTVDZ4QEA8CnG7seek1R3
 xEoo+vw0sWWeLV3qzsxnCA1BJ8cTJA8=
 =z0Lf
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.

 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that. This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches. My
   bad.

 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"

 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability
   of cgroup writeback"

 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache
   index".

 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of
   the zeropage in MAP_SHARED mappings. I don't see any runtime effects
   here - more a cleanup/understandability/maintainablity thing.

 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
   of higher addresses, for aarch64. The (poorly named) series is
   "Restructure va_high_addr_switch".

 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".

 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
   the series "Enhance soft hwpoison handling and injection".

 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything. Some landed in this pull.

 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
   has simplified migration's use of hardware-offload memory copying.

 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".

 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code. This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.

 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.

 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP. By default this
   is a no-op - users must opt in vis sysfs controls. Dramatic
   improvements in pagefault latency are realized.

 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".

 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".

 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".

 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".

 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances. A 10x speedup is seen in a microbenchmark.

   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.

 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".

 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.

 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.

 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".

 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.

 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".

 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.

 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.

 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large
   folio userspace copying.

 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers. From SeongJae Park.

 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.

 - David Hildenbrand sends along more cleanups, this time against the
   migration code. The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".

 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code. He addresses this in the series "mm: Fix various
   readahead quirks".

 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's
   self testing code.

 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code. The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this. The series is marked cc:stable.

 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.

 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion. The series (which also makes the memcg-v1 code
   Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
   under config option" and "mm: memcg: put cgroup v1-specific memcg
   data under CONFIG_MEMCG_V1"

 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.

 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of
   excessive correctable memory errors. In order to permit userspace to
   monitor and handle this situation.

 - Kefeng Wang's series "mm: migrate: support poison recover from
   migrate folio" teaches the kernel to appropriately handle migration
   from poisoned source folios rather than simply panicing.

 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.

 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory
   utilization.

 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than
   bare refcount increments. So these paes can first be moved aside if
   they reside in the movable zone or a CMA block.

 - Andrii Nakryiko has added a binary ioctl()-based API to
   /proc/pid/maps for much faster reading of vma information. The series
   is "query VMAs from /proc/<pid>/maps".

 - In the series "mm: introduce per-order mTHP split counters" Lance
   Yang improves the kernel's presentation of developer information
   related to multisize THP splitting.

 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)". This permits
   userspace to use all available huge page sizes.

 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and
   not very useful feature from slab fault injection.

* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
  mm/mglru: fix ineffective protection calculation
  mm/zswap: fix a white space issue
  mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
  mm/hugetlb: fix possible recursive locking detected warning
  mm/gup: clear the LRU flag of a page before adding to LRU batch
  mm/numa_balancing: teach mpol_to_str about the balancing mode
  mm: memcg1: convert charge move flags to unsigned long long
  alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
  lib: reuse page_ext_data() to obtain codetag_ref
  lib: add missing newline character in the warning message
  mm/mglru: fix overshooting shrinker memory
  mm/mglru: fix div-by-zero in vmpressure_calc_level()
  mm/kmemleak: replace strncpy() with strscpy()
  mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
  mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
  mm: ignore data-race in __swap_writepage
  hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
  mm: shmem: rename mTHP shmem counters
  mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
  mm/migrate: putback split folios when numa hint migration fails
  ...
2024-07-21 17:15:46 -07:00
Linus Torvalds
1c7d0c3af5 s390 updates for 6.11 merge window
- Remove restrictions on PAI NNPA and crypto counters, enabling
   concurrent per-task and system-wide sampling and counting events
 
 - Switch to GENERIC_CPU_DEVICES by setting up the CPU present mask in
   the architecture code and letting the generic code handle CPU bring-up
 
 - Add support for the diag204 busy indication facility to prevent
   undesirable blocking during hypervisor logical CPU utilization
   queries. Implement results caching
 
 - Improve the handling of Store Data SCLP events by suppressing
   unnecessary warning, preventing buffer release in I/O during failures,
   and adding timeout handling for Store Data requests to address potential
   firmware issues
 
 - Provide optimized __arch_hweight*() implementations
 
 - Remove the unnecessary CPU KOBJ_CHANGE uevents generated during topology
   updates, as they are unused and also not present on other architectures
 
 - Cleanup atomic_ops, optimize __atomic_set() for small values and
   __atomic_cmpxchg_bool() for compilers supporting flag output constraint
 
 - Couple of cleanups for KVM:
   - Move and improve KVM struct definitions for DAT tables from gaccess.c
     to a new header
   - Pass the asce as parameter to sie64a()
 
 - Make the crdte() and cspg() page table handling wrappers return a
   boolean to indicate success, like the other existing "compare and swap"
   wrappers
 
 - Add documentation for HWCAP flags
 
 - Switch to obtaining total RAM pages from memblock instead of
   totalram_pages() during mm init, to ensure correct calculation of zero
   page size, when defer_init is enabled
 
 - Refactor lowcore access and switch to using the get_lowcore() function
   instead of the S390_lowcore macro
 
 - Cleanups for PG_arch_1 and folio handling in UV and hugetlb code
 
 - Add missing MODULE_DESCRIPTION() macros
 
 - Fix VM_FAULT_HWPOISON handling in do_exception()
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmaYGegACgkQjYWKoQLX
 FBjCwwf/aRYoLIXCa9/nHGWFiUjZm6xBgVwZh55bXjfNG9TI2J9UZSsYlOFGUJKl
 gvD2Ym+LqAejK8R4EUHkfD6ftaKMQuIxNDoedxhwuSpfOQ2mZ5teu0MxTh8QcUAx
 4Y2w5XEeCuqE3SuoZ4SJa58K4rGl4cFpPsKNa8ofdzH1ZLFNe8Wqzis4kh0htqLb
 FtPj6nsgfzQ5kg14rVkGxCa4CqoFxonXgsA6nH6xZLbxKUInyq8uV44UBQ+aJq5v
 dsdzZ5XuAJHN2FpBuuOYQYZYw3XIy/kka7o4EjffORi5SGCRMWO4Zt0P6HXaNkh6
 xV8EEO8myeo7rV8dnrk1V4yGjGJmfA==
 =3IGY
 -----END PGP SIGNATURE-----

Merge tag 's390-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:

 - Remove restrictions on PAI NNPA and crypto counters, enabling
   concurrent per-task and system-wide sampling and counting events

 - Switch to GENERIC_CPU_DEVICES by setting up the CPU present mask in
   the architecture code and letting the generic code handle CPU
   bring-up

 - Add support for the diag204 busy indication facility to prevent
   undesirable blocking during hypervisor logical CPU utilization
   queries. Implement results caching

 - Improve the handling of Store Data SCLP events by suppressing
   unnecessary warning, preventing buffer release in I/O during
   failures, and adding timeout handling for Store Data requests to
   address potential firmware issues

 - Provide optimized __arch_hweight*() implementations

 - Remove the unnecessary CPU KOBJ_CHANGE uevents generated during
   topology updates, as they are unused and also not present on other
   architectures

 - Cleanup atomic_ops, optimize __atomic_set() for small values and
   __atomic_cmpxchg_bool() for compilers supporting flag output
   constraint

 - Couple of cleanups for KVM:
     - Move and improve KVM struct definitions for DAT tables from
       gaccess.c to a new header
     - Pass the asce as parameter to sie64a()

 - Make the crdte() and cspg() page table handling wrappers return a
   boolean to indicate success, like the other existing "compare and
   swap" wrappers

 - Add documentation for HWCAP flags

 - Switch to obtaining total RAM pages from memblock instead of
   totalram_pages() during mm init, to ensure correct calculation of
   zero page size, when defer_init is enabled

 - Refactor lowcore access and switch to using the get_lowcore()
   function instead of the S390_lowcore macro

 - Cleanups for PG_arch_1 and folio handling in UV and hugetlb code

 - Add missing MODULE_DESCRIPTION() macros

 - Fix VM_FAULT_HWPOISON handling in do_exception()

* tag 's390-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
  s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()
  s390/kvm: Move bitfields for dat tables
  s390/entry: Pass the asce as parameter to sie64a()
  s390/sthyi: Use cached data when diag is busy
  s390/sthyi: Move diag operations
  s390/hypfs_diag: Diag204 busy loop
  s390/diag: Add busy-indication-facility requirements
  s390/diag: Diag204 add busy return errno
  s390/diag: Return errno's from diag204
  s390/sclp: Diag204 busy indication facility detection
  s390/atomic_ops: Make use of flag output constraint
  s390/atomic_ops: Improve __atomic_set() for small values
  s390/atomic_ops: Use symbolic names
  s390/smp: Switch to GENERIC_CPU_DEVICES
  s390/hwcaps: Add documentation for HWCAP flags
  s390/pgtable: Make crdte() and cspg() return a value
  s390/topology: Remove CPU KOBJ_CHANGE uevents
  s390/sclp: Add timeout to Store Data requests
  s390/sclp: Prevent release of buffer in I/O
  s390/sclp: Suppress unnecessary Store Data warning
  ...
2024-07-18 15:41:45 -07:00
Claudio Imbrenda
723ac2d6ba s390/entry: Pass the asce as parameter to sie64a()
Pass the guest ASCE explicitly as parameter, instead of having sie64a()
take it from lowcore.

This removes hidden state from lowcore, and makes things look cleaner.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Link: https://lore.kernel.org/r/20240703155900.103783-2-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-10 19:50:45 +02:00
Mete Durlu
b7a5e5dfbd s390/sthyi: Use cached data when diag is busy
When sthyi is being emulated, data from diag204 is used. If diag204
returns busy, previously cached sthyi info block is returned to the
caller and cache expiry is set to expired.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-10 19:50:45 +02:00
Mete Durlu
6fdf72c9a9 s390/sthyi: Move diag operations
Move diag204 related operations to their own functions for better error
handling and better readability.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-10 19:50:45 +02:00
Mete Durlu
df7e714d6d s390/diag: Diag204 add busy return errno
When diag204-busy-indication facility is installed, diag204 can return
'8' which means device is busy and no operation is done. Add check for
return codes of diag204 call. Return error codes according to diag204
return codes.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-10 19:50:45 +02:00
Mete Durlu
bb9be93acb s390/diag: Return errno's from diag204
Return different errno's from diag204 to allow users to handle them
accordingly. Instead of returning -1 regardless of the failing
condition, return -EINVAL on invalid memory address and -EOPNOTSUPP when
diag instruction fails.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-10 19:50:44 +02:00
Sven Schnelle
4a39f12e75 s390/smp: Switch to GENERIC_CPU_DEVICES
Instead of setting up non-boot CPUs early in architecture code,
only setup the cpu present mask and let the generic code handle
cpu bringup.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-10 19:50:44 +02:00
Ilya Leoshkevich
7e17eac28a s390/unwind: disable KMSAN checks
The unwind code can read uninitialized frames.  Furthermore, even in the
good case, KMSAN does not emit shadow for backchains.  Therefore disable
it for the unwinding functions.

Link: https://lkml.kernel.org/r/20240621113706.315500-37-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:25 -07:00
Ilya Leoshkevich
c1057a707a s390/traps: unpoison the kernel_stack_overflow()'s pt_regs
This is normally done by the generic entry code, but the
kernel_stack_overflow() flow bypasses it.

Link: https://lkml.kernel.org/r/20240621113706.315500-34-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:25 -07:00
Ilya Leoshkevich
0cfd60a6a1 s390/ftrace: unpoison ftrace_regs in kprobe_ftrace_handler()
s390 uses assembly code to initialize ftrace_regs and call
kprobe_ftrace_handler().  Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on kprobe_ftrace_handler() entry.  This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the assembly code and always unpoisoning ftrace_regs in
kprobe_ftrace_handler().

Link: https://lkml.kernel.org/r/20240621113706.315500-30-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:24 -07:00
Ilya Leoshkevich
1f4cf63979 s390/diag: unpoison diag224() output buffer
Diagnose 224 stores 4k bytes, which currently cannot be deduced from the
inline assembly constraints.  This leads to KMSAN false positives.

Fix the constraints by using a 4k-sized struct instead of a raw pointer. 
While at it, prettify them too.

Link: https://lkml.kernel.org/r/20240621113706.315500-29-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:24 -07:00
Mete Durlu
d6d1aa519c s390/topology: Remove CPU KOBJ_CHANGE uevents
s390 generates KOBJ_CHANGE uevents on CPUs whenever a topology update
occurs. These uevents currently have no users and they are also not
present on other architectures. As they are not necessary, remove
these extra uevents.

Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-07-02 10:17:16 +02:00
Arnd Bergmann
5daf62da52 s390: remove native mmap2() syscall
The mmap2() syscall has never been used on 64-bit s390x and should
have been removed as part of 5a79859ae0 ("s390: remove 31 bit
support").

Remove it now.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-06-25 15:57:37 +02:00
Arnd Bergmann
d3882564a7 syscalls: fix compat_sys_io_pgetevents_time64 usage
Using sys_io_pgetevents() as the entry point for compat mode tasks
works almost correctly, but misses the sign extension for the min_nr
and nr arguments.

This was addressed on parisc by switching to
compat_sys_io_pgetevents_time64() in commit 6431e92fc8 ("parisc:
io_pgetevents_time64() needs compat syscall in 32-bit compat mode"),
as well as by using more sophisticated system call wrappers on x86 and
s390. However, arm64, mips, powerpc, sparc and riscv still have the
same bug.

Change all of them over to use compat_sys_io_pgetevents_time64()
like parisc already does. This was clearly the intention when the
function was originally added, but it got hooked up incorrectly in
the tables.

Cc: stable@vger.kernel.org
Fixes: 48166e6ea4 ("y2038: add 64-bit time_t syscalls to all 32-bit architectures")
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-06-25 15:57:20 +02:00
Sven Schnelle
81f907b246 s390/mm: Remove duplicate get_lowcore() calls
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-06-18 17:01:33 +02:00
Sven Schnelle
15428734e1 s390/idle: Remove duplicate get_lowcore() calls
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-06-18 17:01:33 +02:00
Sven Schnelle
46c3031108 s390/vtime: Remove duplicate get_lowcore() calls
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-06-18 17:01:33 +02:00
Sven Schnelle
eb28ec2b2e s390/smp: Remove duplicate get_lowcore() calls
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-06-18 17:01:33 +02:00
Sven Schnelle
d7c3ebc49e s390/nmi: Remove duplicate get_lowcore() calls
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-06-18 17:01:33 +02:00
Sven Schnelle
208da1d5fc s390: Replace S390_lowcore by get_lowcore()
Replace all S390_lowcore usages in arch/s390/ by get_lowcore().

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-06-18 17:01:33 +02:00
Thomas Richter
582cc1b28e s390/pai_ext: Enable per-task and system-wide sampling event
The PMU for PAI NNPA counters enforces the following restriction:

 - No per-task context for PAI sampling event NNPA_ALL
 - No multiple system-wide PAI sampling event NNPA_ALL

Both restrictions are removed. One or more per-task sampling events
are supported. Also one or more system-wide sampling events are
supported.

Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-07 16:49:08 +02:00
Thomas Richter
3f9ff4c5a0 s390/pai_ext: Enable per-task counting event
The PMU for PAI NNPA counters enforces the following restriction:

 - No per-task context for PAI NNPA counters.

This restriction is removed. One or more per-task/system-wide counting
events can now be active at the same time while one system wide
sampling event is active.

Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-07 16:49:08 +02:00
Thomas Richter
14e3768435 s390/pai_ext: Enable concurrent system-wide counting/sampling
The PMU for PAI NNPA counters enforces the following restriction:

- No system wide counting while system wide sampling is active.

This restriction is removed. One or more system wide counting events
can now be active at the same time while at most one system wide
sampling event is active.

Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-07 16:49:07 +02:00
Thomas Richter
9f66572f28 s390/pai_crypto: Enable per-task and system-wide sampling event
The PMU for PAI crypto counters enforces the following restrictions:

      - No per-task context for PAI crypto sampling event CRYPTO_ALL
      - No multiple system-wide PAI crypto sampling event CRYPTO_ALL

Both restrictions are removed. One or more per-task sampling events
are supported. Also one or more system-wide sampling events are
supported.

Example for per-task context of sampling event CRYPTO_ALL:
       # perf record -e pai_crypto/CRYPTO_ALL/ -- true
Example for system-wide context of sampling event CRYPTO_ALL:
       # perf record -e pai_crypto/CRYPTO_ALL/ -a -- sleep 4

Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-07 16:49:07 +02:00
Thomas Richter
92ea686840 s390/pai_crypto: Enable per-task counting event
The PMU for PAI crypto counters enforces the following restriction:

     - No per-task context for PAI crypto counters events.

This restriction is removed. One or more per-task/system-wide counting
events can now be active at the same time while at most one system
wide sampling event is active.

Example for per-task context of a PAI crypto counter event:
   # perf stat -e pai_crypto/KM_AES_128/ -- true

Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-07 16:49:07 +02:00
Thomas Richter
fb412c6241 s390/pai_crypto: Enable concurrent system-wide counting/sampling event
The PMU for PAI crypto counters enforces the following restriction:

 - No system wide counting while system wide sampling is active.

This restriction is removed. One or more system wide counting events
can now be active at the same time while at most one system wide
sampling event is active.

Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-07 16:49:06 +02:00
David Hildenbrand
99b3f8f76f s390/uv: Implement HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE
Let's also implement HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE, so we can convert
arch_make_page_accessible() to be a simple wrapper around
arch_make_folio_accessible(). Unfortunately, we cannot do that in the
header.

There are only two arch_make_page_accessible() calls remaining in gup.c.
We can now drop HAVE_ARCH_MAKE_PAGE_ACCESSIBLE completely form core-MM.
We'll handle that separately, once the s390x part landed.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-10-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:25 +02:00
David Hildenbrand
7d17143469 s390/uv: Convert uv_convert_owned_from_secure() to uv_convert_from_secure_(folio|pte)()
Let's do the same as we did for uv_destroy_(folio|pte)() and
have the following variants:

(1) uv_convert_from_secure(): "low level" helper that operates on paddr
and does not mess with folios.

(2) uv_convert_from_secure_folio(): Consumes a folio to which we hold a
reference.

(3) uv_convert_from_secure_pte(): Consumes a PTE that holds a reference
through the mapping.

Unfortunately we need uv_convert_from_secure_pte(), because pfn_folio()
and friends are not available in pgtable.h.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-9-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:25 +02:00
David Hildenbrand
7063150650 s390/uv: Convert uv_destroy_owned_page() to uv_destroy_(folio|pte)()
Let's have the following variants for destroying pages:

(1) uv_destroy(): Like uv_pin_shared() and uv_convert_from_secure(),
"low level" helper that operates on paddr and doesn't mess with folios.

(2) uv_destroy_folio(): Consumes a folio to which we hold a reference.

(3) uv_destroy_pte(): Consumes a PTE that holds a reference through the
mapping.

Unfortunately we need uv_destroy_pte(), because pfn_folio() and
friends are not available in pgtable.h.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-8-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:25 +02:00
David Hildenbrand
e58623fbc1 s390/uv: Make uv_convert_from_secure() a static function
It's not used outside of uv.c, so let's make it a static function.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-7-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:25 +02:00
David Hildenbrand
80cf817949 s390/uv: Update PG_arch_1 comment
We removed the usage of PG_arch_1 for page tables in commit
a51324c430 ("s390/cmma: rework no-dat handling").

Let's update the comment in UV to reflect that.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-6-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:24 +02:00
David Hildenbrand
036c0e104b s390/uv: Convert PG_arch_1 users to only work on small folios
Now that make_folio_secure() may only set PG_arch_1 for small folios,
let's convert relevant remaining UV code to only work on (small) folios
and simply reject large folios early. This way, we'll never end up
touching PG_arch_1 on tail pages of a large folio in UV code.

The folio_get()/folio_put() for functions that are documented to already
hold a folio reference look weird; likely they are required to make
concurrent gmap_make_secure() back off because the caller might only hold
an implicit reference due to the page mapping. So leave that alone for now.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-5-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:24 +02:00
David Hildenbrand
eef88fe45a s390/uv: Split large folios in gmap_make_secure()
While s390x makes sure to never have PMD-mapped THP in processes that use
KVM -- by remapping them using PTEs in
thp_split_walk_pmd_entry()->split_huge_pmd() -- there is still the
possibility of having PTE-mapped THPs (large folios) mapped into guest
memory.

This would happen if user space allocates memory before calling
KVM_CREATE_VM (which would call s390_enable_sie()). With upstream QEMU,
this currently doesn't happen, because guest memory is setup and
conditionally preallocated after KVM_CREATE_VM.

Could it happen with shmem/file-backed memory when another process
allocated memory in the pagecache? Likely, although currently not a
common setup.

Trying to split any PTE-mapped large folios sounds like the right and
future-proof thing to do here. So let's call split_folio() and handle the
return values accordingly.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-4-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:24 +02:00
David Hildenbrand
68ad4743be s390/uv: gmap_make_secure() cleanups for further changes
Let's factor out handling of LRU cache draining and convert the if-else
chain to a switch-case.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-3-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:24 +02:00
David Hildenbrand
3f29f6537f s390/uv: Don't call folio_wait_writeback() without a folio reference
folio_wait_writeback() requires that no spinlocks are held and that
a folio reference is held, as documented. After we dropped the PTL, the
folio could get freed concurrently. So grab a temporary reference.

Fixes: 214d9bbcd3 ("s390/mm: provide memory management functions for protected KVM guests")
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-2-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:23 +02:00
Alexander Gordeev
d38e48563c s390/crash: Do not use VM info if os_info does not have it
The virtual memory information stored in os_info area is
required for creation of the kernel image PT_LOAD program
header for kernels since commit a2ec5bec56dd ("s390/mm:
uncouple physical vs virtual address spaces").

By contrast, if such information in os_info is absent the
PT_LOAD program header should not be created.

Currently the proper PT_LOAD program header is created for
kernels that contain the virtual memory information, but
for kernels without one an invalid header of zero size is
created. That in turn leads to stand-alone dump failures.

Use OS_INFO_KASLR_OFFSET variable to check whether os_info
is present or not (same as crash and makedumpfile tools do)
and based on that create or do not create the kernel image
PT_LOAD program header.

Fixes: f4cac27dc0 ("s390/crash: Use old os_info to create PT_LOAD headers")
Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Acked-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:03:24 +02:00
Jeff Xu
ff388fe5c4 mseal: wire up mseal syscall
Patch series "Introduce mseal", v10.

This patchset proposes a new mseal() syscall for the Linux kernel.

In a nutshell, mseal() protects the VMAs of a given virtual memory range
against modifications, such as changes to their permission bits.

Modern CPUs support memory permissions, such as the read/write (RW) and
no-execute (NX) bits.  Linux has supported NX since the release of kernel
version 2.6.8 in August 2004 [1].  The memory permission feature improves
the security stance on memory corruption bugs, as an attacker cannot
simply write to arbitrary memory and point the code to it.  The memory
must be marked with the X bit, or else an exception will occur. 
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct).  mseal() additionally protects the
VMA itself against modifications of the selected seal type.

Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system.  For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped.  Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.  A
similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall
[4].  Also, Chrome wants to adopt this feature for their CFI work [2] and
this patchset has been designed to be compatible with the Chrome use case.

Two system calls are involved in sealing the map:  mmap() and mseal().

The new mseal() is an syscall on 64 bit CPU, and with following signature:

int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.

mseal() blocks following operations for the given memory range.

1> Unmapping, moving to another location, and shrinking the size,
   via munmap() and mremap(), can leave an empty space, therefore can
   be replaced with a VMA with a new set of attributes.

2> Moving or expanding a different VMA into the current location,
   via mremap().

3> Modifying a VMA via mmap(MAP_FIXED).

4> Size expansion, via mremap(), does not appear to pose any specific
   risks to sealed VMAs. It is included anyway because the use case is
   unclear. In any case, users can rely on merging to expand a sealed VMA.

5> mprotect() and pkey_mprotect().

6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
   memory, when users don't have write permission to the memory. Those
   behaviors can alter region contents by discarding pages, effectively a
   memset(0) for anonymous memory.

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5].  Chrome browser in ChromeOS will be the first user of this
API.

Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications.  For example, in the
case of libc, sealing is only applied to read-only (RO) or read-execute
(RX) memory segments (such as .text and .RELRO) to prevent them from
becoming writable, the lifetime of those mappings are tied to the lifetime
of the process.

Chrome wants to seal two large address space reservations that are managed
by different allocators.  The memory is mapped RW- and RWX respectively
but write access to it is restricted using pkeys (or in the future ARM
permission overlay extensions).  The lifetime of those mappings are not
tied to the lifetime of the process, therefore, while the memory is
sealed, the allocators still need to free or discard the unused memory. 
For example, with madvise(DONTNEED).

However, always allowing madvise(DONTNEED) on this range poses a security
risk.  For example if a jump instruction crosses a page boundary and the
second page gets discarded, it will overwrite the target bytes with zeros
and change the control flow.  Checking write-permission before the discard
operation allows us to control when the operation is valid.  In this case,
the madvise will only succeed if the executing thread has PKEY write
permissions and PKRU changes are protected in software by control-flow
integrity.

Although the initial version of this patch series is targeting the Chrome
browser as its first user, it became evident during upstream discussions
that we would also want to ensure that the patch set eventually is a
complete solution for memory sealing and compatible with other use cases. 
The specific scenario currently in mind is glibc's use case of loading and
sealing ELF executables.  To this end, Stephen is working on a change to
glibc to add sealing support to the dynamic linker, which will seal all
non-writable segments at startup.  Once this work is completed, all
applications will be able to automatically benefit from these new
protections.

In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental in
shaping this patch:

Jann Horn: raising awareness and providing valuable insights on the
  destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
  implementing mimmutable() in OpenBSD.

MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.

To measure the performance impact of this loop, two tests are developed.
[8]

The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.

The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
    create 1000 mappings (1 page per VMA)
    start the sampling
    for (j = 0; j < 1000, j++)
        mprotect one mapping
    stop and save the sample
    delete 1000 mappings
calculates all samples.

Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.

Based on the latest upstream code:
The first test (measuring time)
syscall__	vmas	t	t_mseal	delta_ns	per_vma	%
munmap__  	1	909	944	35	35	104%
munmap__  	2	1398	1502	104	52	107%
munmap__  	4	2444	2594	149	37	106%
munmap__  	8	4029	4323	293	37	107%
munmap__  	16	6647	6935	288	18	104%
munmap__  	32	11811	12398	587	18	105%
mprotect	1	439	465	26	26	106%
mprotect	2	1659	1745	86	43	105%
mprotect	4	3747	3889	142	36	104%
mprotect	8	6755	6969	215	27	103%
mprotect	16	13748	14144	396	25	103%
mprotect	32	27827	28969	1142	36	104%
madvise_	1	240	262	22	22	109%
madvise_	2	366	442	76	38	121%
madvise_	4	623	751	128	32	121%
madvise_	8	1110	1324	215	27	119%
madvise_	16	2127	2451	324	20	115%
madvise_	32	4109	4642	534	17	113%

The second test (measuring cpu cycle)
syscall__	vmas	cpu	cmseal	delta_cpu	per_vma	%
munmap__	1	1790	1890	100	100	106%
munmap__	2	2819	3033	214	107	108%
munmap__	4	4959	5271	312	78	106%
munmap__	8	8262	8745	483	60	106%
munmap__	16	13099	14116	1017	64	108%
munmap__	32	23221	24785	1565	49	107%
mprotect	1	906	967	62	62	107%
mprotect	2	3019	3203	184	92	106%
mprotect	4	6149	6569	420	105	107%
mprotect	8	9978	10524	545	68	105%
mprotect	16	20448	21427	979	61	105%
mprotect	32	40972	42935	1963	61	105%
madvise_	1	434	497	63	63	115%
madvise_	2	752	899	147	74	120%
madvise_	4	1313	1513	200	50	115%
madvise_	8	2271	2627	356	44	116%
madvise_	16	4312	4883	571	36	113%
madvise_	32	8376	9319	943	29	111%

Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.

In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__	vmas	t	tmseal	delta_ns	per_vma	%
munmap__	1	357	390	33	33	109%
munmap__	2	442	463	21	11	105%
munmap__	4	614	634	20	5	103%
munmap__	8	1017	1137	120	15	112%
munmap__	16	1889	2153	263	16	114%
munmap__	32	4109	4088	-21	-1	99%
mprotect	1	235	227	-7	-7	97%
mprotect	2	495	464	-30	-15	94%
mprotect	4	741	764	24	6	103%
mprotect	8	1434	1437	2	0	100%
mprotect	16	2958	2991	33	2	101%
mprotect	32	6431	6608	177	6	103%
madvise_	1	191	208	16	16	109%
madvise_	2	300	324	24	12	108%
madvise_	4	450	473	23	6	105%
madvise_	8	753	806	53	7	107%
madvise_	16	1467	1592	125	8	108%
madvise_	32	2795	3405	610	19	122%
					
The second test (measuring cpu cycle)
syscall__	nbr_vma	cpu	cmseal	delta_cpu	per_vma	%
munmap__	1	684	715	31	31	105%
munmap__	2	861	898	38	19	104%
munmap__	4	1183	1235	51	13	104%
munmap__	8	1999	2045	46	6	102%
munmap__	16	3839	3816	-23	-1	99%
munmap__	32	7672	7887	216	7	103%
mprotect	1	397	443	46	46	112%
mprotect	2	738	788	50	25	107%
mprotect	4	1221	1256	35	9	103%
mprotect	8	2356	2429	72	9	103%
mprotect	16	4961	4935	-26	-2	99%
mprotect	32	9882	10172	291	9	103%
madvise_	1	351	380	29	29	108%
madvise_	2	565	615	49	25	109%
madvise_	4	872	933	61	15	107%
madvise_	8	1508	1640	132	16	109%
madvise_	16	3078	3323	245	15	108%
madvise_	32	5893	6704	811	25	114%

For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.

It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__	vmas	t_5_10	t_6_8	delta_ns	per_vma	%
munmap__	1	357	909	552	552	254%
munmap__	2	442	1398	956	478	316%
munmap__	4	614	2444	1830	458	398%
munmap__	8	1017	4029	3012	377	396%
munmap__	16	1889	6647	4758	297	352%
munmap__	32	4109	11811	7702	241	287%
mprotect	1	235	439	204	204	187%
mprotect	2	495	1659	1164	582	335%
mprotect	4	741	3747	3006	752	506%
mprotect	8	1434	6755	5320	665	471%
mprotect	16	2958	13748	10790	674	465%
mprotect	32	6431	27827	21397	669	433%
madvise_	1	191	240	49	49	125%
madvise_	2	300	366	67	33	122%
madvise_	4	450	623	173	43	138%
madvise_	8	753	1110	357	45	147%
madvise_	16	1467	2127	660	41	145%
madvise_	32	2795	4109	1314	41	147%

The second test (measuring cpu cycle)
syscall__	vmas	cpu_5_10	c_6_8	delta_cpu	per_vma	%
munmap__	1	684	1790	1106	1106	262%
munmap__	2	861	2819	1958	979	327%
munmap__	4	1183	4959	3776	944	419%
munmap__	8	1999	8262	6263	783	413%
munmap__	16	3839	13099	9260	579	341%
munmap__	32	7672	23221	15549	486	303%
mprotect	1	397	906	509	509	228%
mprotect	2	738	3019	2281	1140	409%
mprotect	4	1221	6149	4929	1232	504%
mprotect	8	2356	9978	7622	953	423%
mprotect	16	4961	20448	15487	968	412%
mprotect	32	9882	40972	31091	972	415%
madvise_	1	351	434	82	82	123%
madvise_	2	565	752	186	93	133%
madvise_	4	872	1313	442	110	151%
madvise_	8	1508	2271	763	95	151%
madvise_	16	3078	4312	1234	77	140%
madvise_	32	5893	8376	2483	78	142%

From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.

In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten times
greater for munmap and mprotect.

When I discuss the mm performance with Brian Makin, an engineer who worked
on performance, it was brought to my attention that such performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service.  Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different.  It might be best to
take this data with a grain of salt.


This patch (of 5):

Wire up mseal syscall for all architectures.

Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org
Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Jann Horn <jannh@google.com> [Bug #2]
Cc: Jeff Xu <jeffxu@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Stephen Röttger <sroettger@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-23 19:40:26 -07:00
Linus Torvalds
2a8120d7b4 more s390 updates for 6.10 merge window
- Switch read and write software bits for PUDs
 
 - Add missing hardware bits for PUDs and PMDs
 
 - Generate unwind information for C modules to fix GDB unwind
   error for vDSO functions
 
 - Create .build-id links for unstripped vDSO files to enable
   vDSO debugging with symbols
 
 - Use standard stack frame layout for vDSO generated stack frames
   to manually walk stack frames without DWARF information
 
 - Rework perf_callchain_user() and arch_stack_walk_user() functions
   to reduce code duplication
 
 - Skip first stack frame when walking user stack
 
 - Add basic checks to identify invalid instruction pointers when
   walking stack frames
 
 - Introduce and use struct stack_frame_vdso_wrapper within vDSO user
   wrapper code to automatically generate an asm-offset define. Also
   use STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to
   document that the code works with user space stack
 
 - Clear the backchain of the extra stack frame added by the vDSO user
   wrapper code. This allows the user stack walker to detect and skip
   the non-standard stack frame. Without this an incorrect instruction
   pointer would be added to stack traces.
 
 - Rewrite psw_idle() function in C to ease maintenance and further
   enhancements
 
 - Remove get_vtimer() function and use get_cpu_timer() instead
 
 - Mark psw variable in __load_psw_mask() as __unitialized to avoid
   superfluous clearing of PSW
 
 - Remove obsolete and superfluous comment about removed TIF_FPU flag
 
 - Replace memzero_explicit() and kfree() with kfree_sensitive() to
   fix warnings reported by Coccinelle
 
 - Wipe sensitive data and all copies of protected- or secure-keys
   from stack when an IOCTL fails
 
 - Both do_airq_interrupt() and do_io_interrupt() functions set
   CIF_NOHZ_DELAY flag. Move it in do_io_irq() to simplify the code
 
 - Provide iucv_alloc_device() and iucv_release_device() helpers,
   which can be used to deduplicate more or less identical IUCV
   device allocation and release code in four different drivers
 
 - Make use of iucv_alloc_device() and iucv_release_device()
   helpers to get rid of quite some code and also remove a
   cast to an incompatible function (clang W=1)
 
 - There is no user of iucv_root outside of the core IUCV code left.
   Therefore remove the EXPORT_SYMBOL
 
 - __apply_alternatives() contains a runtime check which verifies
   that the size of the to be patched code area is even. Convert
   this to a compile time check
 
 - Increase size of buffers for sending z/VM CP DIAGNOSE X'008'
   commands from 128 to 240
 
 - Do not accept z/VM CP DIAGNOSE X'008' commands longer than
   maximally allowed
 
 - Use correct defines IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN instead
   of IPL_BP_FCP_LEN and IPL_BP0_FCP_LEN ones to initialize NVMe
   reIPL block on 'scp_data' sysfs attribute update
 
 - Initialize the correct fields of the NVMe dump block, which
   were confused with FCP fields
 
 - Refactor macros for 'scp_data' (re-)IPL sysfs attribute to
   reduce code duplication
 
 - Introduce 'scp_data' sysfs attribute for dump IPL to allow tools
   such as dumpconf passing additional kernel command line parameters
   to a stand-alone dumper
 
 - Rework the CPACF query functions to use the correct RRE or RRF
   instruction formats and set instruction register fields correctly
 
 - Instead of calling BUG() at runtime force a link error during
   compile when a unsupported opcode is used with __cpacf_query()
   or __cpacf_check_opcode() functions
 
 - Fix a crash in ap_parse_bitmap_str() function on /sys/bus/ap/apmask
   or /sys/bus/ap/aqmask sysfs file update with a relative mask value
 
 - Fix "bindings complete" udev event which should be sent once all AP
   devices have been bound to device drivers and again when unbind/bind
   actions take place and all AP devices are bound again
 
 - Facility list alt_stfle_fac_list is nowhere used in the decompressor,
   therefore remove it there
 
 - Remove custom kprobes insn slot allocator in favour of the standard
   module_alloc() one, since kernel image and module areas are located
   within 4GB
 
 - Use kvcalloc() instead of kvmalloc_array() in zcrypt driver to avoid
   calling memset() with a large byte count and get rid of the sparse
   warning as result
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZkyx2BccYWdvcmRlZXZA
 bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8PYZAP9KxEfTyUmIh61Gx8+m3BW5dy7p
 E2Q8yotlUpGj49ul+AD8CEAyTiWR95AlMOVZZLV/0J7XIjhALvpKAGfiJWkvXgc=
 =pife
 -----END PGP SIGNATURE-----

Merge tag 's390-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Alexander Gordeev:

 - Switch read and write software bits for PUDs

 - Add missing hardware bits for PUDs and PMDs

 - Generate unwind information for C modules to fix GDB unwind error for
   vDSO functions

 - Create .build-id links for unstripped vDSO files to enable vDSO
   debugging with symbols

 - Use standard stack frame layout for vDSO generated stack frames to
   manually walk stack frames without DWARF information

 - Rework perf_callchain_user() and arch_stack_walk_user() functions to
   reduce code duplication

 - Skip first stack frame when walking user stack

 - Add basic checks to identify invalid instruction pointers when
   walking stack frames

 - Introduce and use struct stack_frame_vdso_wrapper within vDSO user
   wrapper code to automatically generate an asm-offset define. Also use
   STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to document
   that the code works with user space stack

 - Clear the backchain of the extra stack frame added by the vDSO user
   wrapper code. This allows the user stack walker to detect and skip
   the non-standard stack frame. Without this an incorrect instruction
   pointer would be added to stack traces.

 - Rewrite psw_idle() function in C to ease maintenance and further
   enhancements

 - Remove get_vtimer() function and use get_cpu_timer() instead

 - Mark psw variable in __load_psw_mask() as __unitialized to avoid
   superfluous clearing of PSW

 - Remove obsolete and superfluous comment about removed TIF_FPU flag

 - Replace memzero_explicit() and kfree() with kfree_sensitive() to fix
   warnings reported by Coccinelle

 - Wipe sensitive data and all copies of protected- or secure-keys from
   stack when an IOCTL fails

 - Both do_airq_interrupt() and do_io_interrupt() functions set
   CIF_NOHZ_DELAY flag. Move it in do_io_irq() to simplify the code

 - Provide iucv_alloc_device() and iucv_release_device() helpers, which
   can be used to deduplicate more or less identical IUCV device
   allocation and release code in four different drivers

 - Make use of iucv_alloc_device() and iucv_release_device() helpers to
   get rid of quite some code and also remove a cast to an incompatible
   function (clang W=1)

 - There is no user of iucv_root outside of the core IUCV code left.
   Therefore remove the EXPORT_SYMBOL

 - __apply_alternatives() contains a runtime check which verifies that
   the size of the to be patched code area is even. Convert this to a
   compile time check

 - Increase size of buffers for sending z/VM CP DIAGNOSE X'008' commands
   from 128 to 240

 - Do not accept z/VM CP DIAGNOSE X'008' commands longer than maximally
   allowed

 - Use correct defines IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN instead of
   IPL_BP_FCP_LEN and IPL_BP0_FCP_LEN ones to initialize NVMe reIPL
   block on 'scp_data' sysfs attribute update

 - Initialize the correct fields of the NVMe dump block, which were
   confused with FCP fields

 - Refactor macros for 'scp_data' (re-)IPL sysfs attribute to reduce
   code duplication

 - Introduce 'scp_data' sysfs attribute for dump IPL to allow tools such
   as dumpconf passing additional kernel command line parameters to a
   stand-alone dumper

 - Rework the CPACF query functions to use the correct RRE or RRF
   instruction formats and set instruction register fields correctly

 - Instead of calling BUG() at runtime force a link error during compile
   when a unsupported opcode is used with __cpacf_query() or
   __cpacf_check_opcode() functions

 - Fix a crash in ap_parse_bitmap_str() function on /sys/bus/ap/apmask
   or /sys/bus/ap/aqmask sysfs file update with a relative mask value

 - Fix "bindings complete" udev event which should be sent once all AP
   devices have been bound to device drivers and again when unbind/bind
   actions take place and all AP devices are bound again

 - Facility list alt_stfle_fac_list is nowhere used in the decompressor,
   therefore remove it there

 - Remove custom kprobes insn slot allocator in favour of the standard
   module_alloc() one, since kernel image and module areas are located
   within 4GB

 - Use kvcalloc() instead of kvmalloc_array() in zcrypt driver to avoid
   calling memset() with a large byte count and get rid of the sparse
   warning as result

* tag 's390-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
  s390/zcrypt: Use kvcalloc() instead of kvmalloc_array()
  s390/kprobes: Remove custom insn slot allocator
  s390/boot: Remove alt_stfle_fac_list from decompressor
  s390/ap: Fix bind complete udev event sent after each AP bus scan
  s390/ap: Fix crash in AP internal function modify_bitmap()
  s390/cpacf: Make use of invalid opcode produce a link error
  s390/cpacf: Split and rework cpacf query functions
  s390/ipl: Introduce sysfs attribute 'scp_data' for dump ipl
  s390/ipl: Introduce macros for (re)ipl sysfs attribute 'scp_data'
  s390/ipl: Fix incorrect initialization of nvme dump block
  s390/ipl: Fix incorrect initialization of len fields in nvme reipl block
  s390/ipl: Do not accept z/VM CP diag X'008' cmds longer than max length
  s390/ipl: Fix size of vmcmd buffers for sending z/VM CP diag X'008' cmds
  s390/alternatives: Convert runtime sanity check into compile time check
  s390/iucv: Unexport iucv_root
  tty: hvc-iucv: Make use of iucv_alloc_device()
  s390/smsgiucv_app: Make use of iucv_alloc_device()
  s390/netiucv: Make use of iucv_alloc_device()
  s390/vmlogrdr: Make use of iucv_alloc_device()
  s390/iucv: Provide iucv_alloc_device() / iucv_release_device()
  ...
2024-05-21 12:09:36 -07:00
Linus Torvalds
61307b7be4 The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.  Notable
 series include:
 
 - Lucas Stach has provided some page-mapping
   cleanup/consolidation/maintainability work in the series "mm/treewide:
   Remove pXd_huge() API".
 
 - In the series "Allow migrate on protnone reference with
   MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
   MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one
   test.
 
 - In their series "Memory allocation profiling" Kent Overstreet and
   Suren Baghdasaryan have contributed a means of determining (via
   /proc/allocinfo) whereabouts in the kernel memory is being allocated:
   number of calls and amount of memory.
 
 - Matthew Wilcox has provided the series "Various significant MM
   patches" which does a number of rather unrelated things, but in largely
   similar code sites.
 
 - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes
   Weiner has fixed the page allocator's handling of migratetype requests,
   with resulting improvements in compaction efficiency.
 
 - In the series "make the hugetlb migration strategy consistent" Baolin
   Wang has fixed a hugetlb migration issue, which should improve hugetlb
   allocation reliability.
 
 - Liu Shixin has hit an I/O meltdown caused by readahead in a
   memory-tight memcg.  Addressed in the series "Fix I/O high when memory
   almost met memcg limit".
 
 - In the series "mm/filemap: optimize folio adding and splitting" Kairui
   Song has optimized pagecache insertion, yielding ~10% performance
   improvement in one test.
 
 - Baoquan He has cleaned up and consolidated the early zone
   initialization code in the series "mm/mm_init.c: refactor
   free_area_init_core()".
 
 - Baoquan has also redone some MM initializatio code in the series
   "mm/init: minor clean up and improvement".
 
 - MM helper cleanups from Christoph Hellwig in his series "remove
   follow_pfn".
 
 - More cleanups from Matthew Wilcox in the series "Various page->flags
   cleanups".
 
 - Vlastimil Babka has contributed maintainability improvements in the
   series "memcg_kmem hooks refactoring".
 
 - More folio conversions and cleanups in Matthew Wilcox's series
 
 	"Convert huge_zero_page to huge_zero_folio"
 	"khugepaged folio conversions"
 	"Remove page_idle and page_young wrappers"
 	"Use folio APIs in procfs"
 	"Clean up __folio_put()"
 	"Some cleanups for memory-failure"
 	"Remove page_mapping()"
 	"More folio compat code removal"
 
 - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb
   functions to work on folis".
 
 - Code consolidation and cleanup work related to GUP's handling of
   hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
 
 - Rick Edgecombe has developed some fixes to stack guard gaps in the
   series "Cover a guard gap corner case".
 
 - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series
   "mm/ksm: fix ksm exec support for prctl".
 
 - Baolin Wang has implemented NUMA balancing for multi-size THPs.  This
   is a simple first-cut implementation for now.  The series is "support
   multi-size THP numa balancing".
 
 - Cleanups to vma handling helper functions from Matthew Wilcox in the
   series "Unify vma_address and vma_pgoff_address".
 
 - Some selftests maintenance work from Dev Jain in the series
   "selftests/mm: mremap_test: Optimizations and style fixes".
 
 - Improvements to the swapping of multi-size THPs from Ryan Roberts in
   the series "Swap-out mTHP without splitting".
 
 - Kefeng Wang has significantly optimized the handling of arm64's
   permission page faults in the series
 
 	"arch/mm/fault: accelerate pagefault when badaccess"
 	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
 
 - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it
   GUP-fast".
 
 - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to
   use struct vm_fault".
 
 - selftests build fixes from John Hubbard in the series "Fix
   selftests/mm build without requiring "make headers"".
 
 - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
   series "Improved Memory Tier Creation for CPUless NUMA Nodes".  Fixes
   the initialization code so that migration between different memory types
   works as intended.
 
 - David Hildenbrand has improved follow_pte() and fixed an errant driver
   in the series "mm: follow_pte() improvements and acrn follow_pte()
   fixes".
 
 - David also did some cleanup work on large folio mapcounts in his
   series "mm: mapcount for large folios + page_mapcount() cleanups".
 
 - Folio conversions in KSM in Alex Shi's series "transfer page to folio
   in KSM".
 
 - Barry Song has added some sysfs stats for monitoring multi-size THP's
   in the series "mm: add per-order mTHP alloc and swpout counters".
 
 - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled
   and limit checking cleanups".
 
 - Matthew Wilcox has been looking at buffer_head code and found the
   documentation to be lacking.  The series is "Improve buffer head
   documentation".
 
 - Multi-size THPs get more work, this time from Lance Yang.  His series
   "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes
   the freeing of these things.
 
 - Kemeng Shi has added more userspace-visible writeback instrumentation
   in the series "Improve visibility of writeback".
 
 - Kemeng Shi then sent some maintenance work on top in the series "Fix
   and cleanups to page-writeback".
 
 - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the
   series "Improve anon_vma scalability for anon VMAs".  Intel's test bot
   reported an improbable 3x improvement in one test.
 
 - SeongJae Park adds some DAMON feature work in the series
 
 	"mm/damon: add a DAMOS filter type for page granularity access recheck"
 	"selftests/damon: add DAMOS quota goal test"
 
 - Also some maintenance work in the series
 
 	"mm/damon/paddr: simplify page level access re-check for pageout"
 	"mm/damon: misc fixes and improvements"
 
 - David Hildenbrand has disabled some known-to-fail selftests ni the
   series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL".
 
 - memcg metadata storage optimizations from Shakeel Butt in "memcg:
   reduce memory consumption by memcg stats".
 
 - DAX fixes and maintenance work from Vishal Verma in the series
   "dax/bus.c: Fixups for dax-bus locking".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA
 jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB
 nvA4E0DcPrUAFy144FNM0NTCb7u9vAw=
 =V3R/
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:
 "The usual shower of singleton fixes and minor series all over MM,
  documented (hopefully adequately) in the respective changelogs.
  Notable series include:

   - Lucas Stach has provided some page-mapping cleanup/consolidation/
     maintainability work in the series "mm/treewide: Remove pXd_huge()
     API".

   - In the series "Allow migrate on protnone reference with
     MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
     MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
     one test.

   - In their series "Memory allocation profiling" Kent Overstreet and
     Suren Baghdasaryan have contributed a means of determining (via
     /proc/allocinfo) whereabouts in the kernel memory is being
     allocated: number of calls and amount of memory.

   - Matthew Wilcox has provided the series "Various significant MM
     patches" which does a number of rather unrelated things, but in
     largely similar code sites.

   - In his series "mm: page_alloc: freelist migratetype hygiene"
     Johannes Weiner has fixed the page allocator's handling of
     migratetype requests, with resulting improvements in compaction
     efficiency.

   - In the series "make the hugetlb migration strategy consistent"
     Baolin Wang has fixed a hugetlb migration issue, which should
     improve hugetlb allocation reliability.

   - Liu Shixin has hit an I/O meltdown caused by readahead in a
     memory-tight memcg. Addressed in the series "Fix I/O high when
     memory almost met memcg limit".

   - In the series "mm/filemap: optimize folio adding and splitting"
     Kairui Song has optimized pagecache insertion, yielding ~10%
     performance improvement in one test.

   - Baoquan He has cleaned up and consolidated the early zone
     initialization code in the series "mm/mm_init.c: refactor
     free_area_init_core()".

   - Baoquan has also redone some MM initializatio code in the series
     "mm/init: minor clean up and improvement".

   - MM helper cleanups from Christoph Hellwig in his series "remove
     follow_pfn".

   - More cleanups from Matthew Wilcox in the series "Various
     page->flags cleanups".

   - Vlastimil Babka has contributed maintainability improvements in the
     series "memcg_kmem hooks refactoring".

   - More folio conversions and cleanups in Matthew Wilcox's series:
	"Convert huge_zero_page to huge_zero_folio"
	"khugepaged folio conversions"
	"Remove page_idle and page_young wrappers"
	"Use folio APIs in procfs"
	"Clean up __folio_put()"
	"Some cleanups for memory-failure"
	"Remove page_mapping()"
	"More folio compat code removal"

   - David Hildenbrand chipped in with "fs/proc/task_mmu: convert
     hugetlb functions to work on folis".

   - Code consolidation and cleanup work related to GUP's handling of
     hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

   - Rick Edgecombe has developed some fixes to stack guard gaps in the
     series "Cover a guard gap corner case".

   - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
     series "mm/ksm: fix ksm exec support for prctl".

   - Baolin Wang has implemented NUMA balancing for multi-size THPs.
     This is a simple first-cut implementation for now. The series is
     "support multi-size THP numa balancing".

   - Cleanups to vma handling helper functions from Matthew Wilcox in
     the series "Unify vma_address and vma_pgoff_address".

   - Some selftests maintenance work from Dev Jain in the series
     "selftests/mm: mremap_test: Optimizations and style fixes".

   - Improvements to the swapping of multi-size THPs from Ryan Roberts
     in the series "Swap-out mTHP without splitting".

   - Kefeng Wang has significantly optimized the handling of arm64's
     permission page faults in the series
	"arch/mm/fault: accelerate pagefault when badaccess"
	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

   - GUP cleanups from David Hildenbrand in "mm/gup: consistently call
     it GUP-fast".

   - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
     path to use struct vm_fault".

   - selftests build fixes from John Hubbard in the series "Fix
     selftests/mm build without requiring "make headers"".

   - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
     series "Improved Memory Tier Creation for CPUless NUMA Nodes".
     Fixes the initialization code so that migration between different
     memory types works as intended.

   - David Hildenbrand has improved follow_pte() and fixed an errant
     driver in the series "mm: follow_pte() improvements and acrn
     follow_pte() fixes".

   - David also did some cleanup work on large folio mapcounts in his
     series "mm: mapcount for large folios + page_mapcount() cleanups".

   - Folio conversions in KSM in Alex Shi's series "transfer page to
     folio in KSM".

   - Barry Song has added some sysfs stats for monitoring multi-size
     THP's in the series "mm: add per-order mTHP alloc and swpout
     counters".

   - Some zswap cleanups from Yosry Ahmed in the series "zswap
     same-filled and limit checking cleanups".

   - Matthew Wilcox has been looking at buffer_head code and found the
     documentation to be lacking. The series is "Improve buffer head
     documentation".

   - Multi-size THPs get more work, this time from Lance Yang. His
     series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
     optimizes the freeing of these things.

   - Kemeng Shi has added more userspace-visible writeback
     instrumentation in the series "Improve visibility of writeback".

   - Kemeng Shi then sent some maintenance work on top in the series
     "Fix and cleanups to page-writeback".

   - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
     the series "Improve anon_vma scalability for anon VMAs". Intel's
     test bot reported an improbable 3x improvement in one test.

   - SeongJae Park adds some DAMON feature work in the series
	"mm/damon: add a DAMOS filter type for page granularity access recheck"
	"selftests/damon: add DAMOS quota goal test"

   - Also some maintenance work in the series
	"mm/damon/paddr: simplify page level access re-check for pageout"
	"mm/damon: misc fixes and improvements"

   - David Hildenbrand has disabled some known-to-fail selftests ni the
     series "selftests: mm: cow: flag vmsplice() hugetlb tests as
     XFAIL".

   - memcg metadata storage optimizations from Shakeel Butt in "memcg:
     reduce memory consumption by memcg stats".

   - DAX fixes and maintenance work from Vishal Verma in the series
     "dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
  memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
  selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
  selftests: cgroup: add tests to verify the zswap writeback path
  mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
  mm/damon/core: fix return value from damos_wmark_metric_value
  mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
  selftests: cgroup: remove redundant enabling of memory controller
  Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
  Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
  Docs/mm/damon/design: use a list for supported filters
  Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
  Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
  selftests/damon: classify tests for functionalities and regressions
  selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
  selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
  selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
  mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
  selftests/damon: add a test for DAMOS quota goal
  ...
2024-05-19 09:21:03 -07:00
Linus Torvalds
ff9a79307f Kbuild updates for v6.10
- Avoid 'constexpr', which is a keyword in C23
 
  - Allow 'dtbs_check' and 'dt_compatible_check' run independently of
    'dt_binding_check'
 
  - Fix weak references to avoid GOT entries in position-independent
    code generation
 
  - Convert the last use of 'optional' property in arch/sh/Kconfig
 
  - Remove support for the 'optional' property in Kconfig
 
  - Remove support for Clang's ThinLTO caching, which does not work with
    the .incbin directive
 
  - Change the semantics of $(src) so it always points to the source
    directory, which fixes Makefile inconsistencies between upstream and
    downstream
 
  - Fix 'make tar-pkg' for RISC-V to produce a consistent package
 
  - Provide reasonable default coverage for objtool, sanitizers, and
    profilers
 
  - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
 
  - Remove the last use of tristate choice in drivers/rapidio/Kconfig
 
  - Various cleanups and fixes in Kconfig
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX
 2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa
 msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj
 GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi
 inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ
 lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv
 JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm
 Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw
 y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL
 orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2
 aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd
 ZCSnZ31h7woGfNho
 =D5B/
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Avoid 'constexpr', which is a keyword in C23

 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of
   'dt_binding_check'

 - Fix weak references to avoid GOT entries in position-independent code
   generation

 - Convert the last use of 'optional' property in arch/sh/Kconfig

 - Remove support for the 'optional' property in Kconfig

 - Remove support for Clang's ThinLTO caching, which does not work with
   the .incbin directive

 - Change the semantics of $(src) so it always points to the source
   directory, which fixes Makefile inconsistencies between upstream and
   downstream

 - Fix 'make tar-pkg' for RISC-V to produce a consistent package

 - Provide reasonable default coverage for objtool, sanitizers, and
   profilers

 - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.

 - Remove the last use of tristate choice in drivers/rapidio/Kconfig

 - Various cleanups and fixes in Kconfig

* tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits)
  kconfig: use sym_get_choice_menu() in sym_check_prop()
  rapidio: remove choice for enumeration
  kconfig: lxdialog: remove initialization with A_NORMAL
  kconfig: m/nconf: merge two item_add_str() calls
  kconfig: m/nconf: remove dead code to display value of bool choice
  kconfig: m/nconf: remove dead code to display children of choice members
  kconfig: gconf: show checkbox for choice correctly
  kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal
  Makefile: remove redundant tool coverage variables
  kbuild: provide reasonable defaults for tool coverage
  modules: Drop the .export_symbol section from the final modules
  kconfig: use menu_list_for_each_sym() in sym_check_choice_deps()
  kconfig: use sym_get_choice_menu() in conf_write_defconfig()
  kconfig: add sym_get_choice_menu() helper
  kconfig: turn defaults and additional prompt for choice members into error
  kconfig: turn missing prompt for choice members into error
  kconfig: turn conf_choice() into void function
  kconfig: use linked list in sym_set_changed()
  kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED
  kconfig: gconf: remove debug code
  ...
2024-05-18 12:39:20 -07:00
Linus Torvalds
70a663205d Probes updates for v6.10:
- tracing/probes: Adding new pseudo-types %pd and %pD support for dumping
   dentry name from 'struct dentry *' and file name from 'struct file *'.
 
 - uprobes: Some performance optimizations have been done.
  . Speed up the BPF uprobe event by delaying the fetching of the uprobe
    event arguments that are not used in BPF.
  . Avoid locking by speculatively checking whether uprobe event is valid.
  . Reduce lock contention by using read/write_lock instead of spinlock for
    uprobe list operation. This improved BPF uprobe benchmark result 43% on
    average.
 
 - rethook: Removes non-fatal warning messages when tracing stack from BPF
   and skip rcu_is_watching() validation in rethook if possible.
 
 - objpool: Optimizing objpool (which is used by kretprobes and fprobe as
   rethook backend storage) by inlining functions and avoid caching nr_cpu_ids
   because it is a const value.
 
 - fprobe: Add entry/exit callbacks types (code cleanup)
 - kprobes: Check ftrace was killed in kprobes if it uses ftrace.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmZFUxsbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b+fIH/A96/SeC5WRLhXmHfTCM
 IvKUea2n0b0oV/2pVfHqfkCBTICuUZ97Opd9VH9jLtjBOTh0fUOGZ2DNVGdSYfWm
 IIkS5dhuZxHXrSHEVYykwLHI3AOL7Q6Ny9EmOg1CNMidUkPMNtBvppsBYPlFU/B/
 qQJAvOdkVOnNITCaas0+MNgepoVVKdJzdNQ1I4WrGyG8isCZBaCYKo2QcGyheCNN
 y8NXvnVHgmgHQ8nTaeE5AawclFzFnhwHfPQPe1kiyGrx15b8K+VYmaZxPKv33A1a
 KT3TKJ1Ep7s7iWFh2iPVJzIwOXCmSnvNTKfNx/MDuKtO7UVfFwytoMEaekbmv3bG
 VqM=
 =n/mW
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

 - tracing/probes: Add new pseudo-types %pd and %pD support for dumping
   dentry name from 'struct dentry *' and file name from 'struct file *'

 - uprobes performance optimizations:
    - Speed up the BPF uprobe event by delaying the fetching of the
      uprobe event arguments that are not used in BPF
    - Avoid locking by speculatively checking whether uprobe event is
      valid
    - Reduce lock contention by using read/write_lock instead of
      spinlock for uprobe list operation. This improved BPF uprobe
      benchmark result 43% on average

 - rethook: Remove non-fatal warning messages when tracing stack from
   BPF and skip rcu_is_watching() validation in rethook if possible

 - objpool: Optimize objpool (which is used by kretprobes and fprobe as
   rethook backend storage) by inlining functions and avoid caching
   nr_cpu_ids because it is a const value

 - fprobe: Add entry/exit callbacks types (code cleanup)

 - kprobes: Check ftrace was killed in kprobes if it uses ftrace

* tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobe/ftrace: bail out if ftrace was killed
  selftests/ftrace: Fix required features for VFS type test case
  objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
  objpool: enable inlining objpool_push() and objpool_pop() operations
  rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
  ftrace: make extra rcu_is_watching() validation check optional
  uprobes: reduce contention on uprobes_tree access
  rethook: Remove warning messages printed for finding return address of a frame.
  fprobe: Add entry/exit callbacks types
  selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD"
  selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"
  Documentation: tracing: add new type '%pd' and '%pD' for kprobe
  tracing/probes: support '%pD' type for print struct file's name
  tracing/probes: support '%pd' type for print struct dentry's name
  uprobes: add speculative lockless system-wide uprobe filter check
  uprobes: prepare uprobe args buffer lazily
  uprobes: encapsulate preparation of uprobe args buffer
2024-05-17 18:29:30 -07:00
Heiko Carstens
d890e6af50 s390/kprobes: Remove custom insn slot allocator
Since commit c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address
spaces") the kernel image and module area are within the same 4GB area.

This eliminates the need of a custom insn slot allocator for kprobes within
the kernel image, since standard module_alloc() allocated pages are
sufficient for PC relative instructions with a signed 32 bit offset.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-05-16 10:31:32 +02:00
Sven Schnelle
e7dec0b792 s390/boot: Remove alt_stfle_fac_list from decompressor
It is nowhere used in the decompressor, therefore remove it.

Fixes: 17e89e1340 ("s390/facilities: move stfl information from lowcore to global data")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-05-16 10:17:12 +02:00
Stephen Brennan
1a7d0890dd kprobe/ftrace: bail out if ftrace was killed
If an error happens in ftrace, ftrace_kill() will prevent disarming
kprobes. Eventually, the ftrace_ops associated with the kprobes will be
freed, yet the kprobes will still be active, and when triggered, they
will use the freed memory, likely resulting in a page fault and panic.

This behavior can be reproduced quite easily, by creating a kprobe and
then triggering a ftrace_kill(). For simplicity, we can simulate an
ftrace error with a kernel module like [1]:

[1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer

  sudo perf probe --add commit_creds
  sudo perf trace -e probe:commit_creds
  # In another terminal
  make
  sudo insmod ftrace_killer.ko  # calls ftrace_kill(), simulating bug
  # Back to perf terminal
  # ctrl-c
  sudo perf probe --del commit_creds

After a short period, a page fault and panic would occur as the kprobe
continues to execute and uses the freed ftrace_ops. While ftrace_kill()
is supposed to be used only in extreme circumstances, it is invoked in
FTRACE_WARN_ON() and so there are many places where an unexpected bug
could be triggered, yet the system may continue operating, possibly
without the administrator noticing. If ftrace_kill() does not panic the
system, then we should do everything we can to continue operating,
rather than leave a ticking time bomb.

Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-16 07:23:30 +09:00
Alexander Egorenkov
24bf80f78e s390/ipl: Introduce sysfs attribute 'scp_data' for dump ipl
This is analogous to the reipl's sysfs attribute named equally and enables
tools such as s390-tools' dumpconf to pass additional kernel cmdline
parameters to a stand-alone dumper such as zfcpdump (e.g. to enable
debug output with 'dump_debug' parameter) or ngdump.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:55 +02:00
Alexander Egorenkov
005ca01333 s390/ipl: Introduce macros for (re)ipl sysfs attribute 'scp_data'
This is a refactoring change to reduce code duplication and improve code
reuse.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:55 +02:00
Alexander Egorenkov
7faacaeaf6 s390/ipl: Fix incorrect initialization of nvme dump block
Initialize the correct fields of the nvme dump block.
This bug had not been detected before because first, the fcp and nvme fields
of struct ipl_parameter_block are part of the same union and, therefore,
overlap in memory and second, they are identical in structure and size.

Fixes: d70e38cb1d ("s390: nvme dump support")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:55 +02:00
Alexander Egorenkov
9c922b73ac s390/ipl: Fix incorrect initialization of len fields in nvme reipl block
Use correct symbolic constants IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN
to initialize nvme reipl block when 'scp_data' sysfs attribute is
being updated. This bug had not been detected before because
the corresponding fcp and nvme symbolic constants are equal.

Fixes: 23a457b8d5 ("s390: nvme reipl")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:54 +02:00
Alexander Egorenkov
247576bf62 s390/ipl: Do not accept z/VM CP diag X'008' cmds longer than max length
The old implementation of vmcmd sysfs string attributes truncated passed
z/VM CP diagnose X'008' commands which were longer than the max allowed
number of characters but the reported number of written characters was
still equal to the entire length of a given string. This can result in
silent failures of some s390-tools (e.g. dumpconf) which can be very hard
to detect. Therefore, this commit makes a write attempt to a vmcmd sysfs
attribute
* fail with E2BIG error if a given string is longer than the maximum
  allowed one
* never destroy the old data in the vmcmd sysfs attribute if the new data
  doesn't fit into it entirely
* return the actual number of written characters if it succeeds

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:54 +02:00
Alexander Egorenkov
72935e3ada s390/ipl: Fix size of vmcmd buffers for sending z/VM CP diag X'008' cmds
z/VM CP diagnose X'008' accepts commands of max 240 characters.
Using a smaller value as a buffer size makes kernel send truncated CP
commands which are longer than the old buffer size. This can result in
invalid CP commands passed to z/VM.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:54 +02:00
Heiko Carstens
207ddb9189 s390/alternatives: Convert runtime sanity check into compile time check
__apply_alternatives() contains a runtime check which verifies that the
size of the to be patched code area is even. Convert this to a compile time
check using a similar ".org" trick, which is already used to verify that
old and new code areas have the same size.

Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:23 +02:00
Sven Schnelle
1084562ec8 s390/irq: Set CIF_NOHZ_DELAY in do_io_irq()
Both do_airq_interrupt() and do_io_interrupt() set
CIF_NOHZ_DELAY. Move it to do_io_irq() to simplify
the code.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:19:46 +02:00