This reverts commits 5c9d2d5c26, c7da82b894, and e7fe7b5cae.
We'll probably need to revisit this, but basically we should not
complicate the get_user_pages_fast() case, and checking the actual page
table protection key bits will require more care anyway, since the
protection keys depend on the exact state of the VM in question.
Particularly when doing a "remote" page lookup (ie in somebody elses VM,
not your own), you need to be much more careful than this was. Dave
Hansen says:
"So, the underlying bug here is that we now a get_user_pages_remote()
and then go ahead and do the p*_access_permitted() checks against the
current PKRU. This was introduced recently with the addition of the
new p??_access_permitted() calls.
We have checks in the VMA path for the "remote" gups and we avoid
consulting PKRU for them. This got missed in the pkeys selftests
because I did a ptrace read, but not a *write*. I also didn't
explicitly test it against something where a COW needed to be done"
It's also not entirely clear that it makes sense to check the protection
key bits at this level at all. But one possible eventual solution is to
make the get_user_pages_fast() case just abort if it sees protection key
bits set, which makes us fall back to the regular get_user_pages() case,
which then has a vma and can do the check there if we want to.
We'll see.
Somewhat related to this all: what we _do_ want to do some day is to
check the PAGE_USER bit - it should obviously always be set for user
pages, but it would be a good check to have back. Because we have no
generic way to test for it, we lost it as part of moving over from the
architecture-specific x86 GUP implementation to the generic one in
commit e585513b76 ("x86/mm/gup: Switch GUP to the generic
get_user_page_fast() implementation").
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 'access_permitted' helper is used in the gup-fast path and goes
beyond the simple _PAGE_RW check to also:
- validate that the mapping is writable from a protection keys
standpoint
- validate that the pte has _PAGE_USER set since all fault paths where
pud_write is must be referencing user-memory.
[dan.j.williams@intel.com: fix powerpc compile error]
Link: http://lkml.kernel.org/r/151129127237.37405.16073414520854722485.stgit@dwillia2-desk3.amr.corp.intel.com
Link: http://lkml.kernel.org/r/151043110453.2842.2166049702068628177.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In response to compile breakage introduced by a series that added the
pud_write helper to x86, Stephen notes:
did you consider using the other paradigm:
In arch include files:
#define pud_write pud_write
static inline int pud_write(pud_t pud)
.....
Then in include/asm-generic/pgtable.h:
#ifndef pud_write
tatic inline int pud_write(pud_t pud)
{
....
}
#endif
If you had, then the powerpc code would have worked ... ;-) and many
of the other interfaces in include/asm-generic/pgtable.h are
protected that way ...
Given that some architecture already define pmd_write() as a macro, it's
a net reduction to drop the definition of __HAVE_ARCH_PMD_WRITE.
Link: http://lkml.kernel.org/r/151129126721.37405.13339850900081557813.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 227be799c3 ("s390/mm: uninline pmdp_xxx functions from pgtable.h")
inadvertently changed the behavior of pmdp_invalidate(), so that it now
clears the pmd instead of just marking it as invalid. Fix this by restoring
the original behavior.
A possible impact of the misbehaving pmdp_invalidate() would be the
MADV_DONTNEED races (see commits ced10803 and 58ceeb6b), although we
should not have any negative impact on the related dirty/young flags,
since those flags are not set by the hardware on s390.
Fixes: 227be799c3 ("s390/mm: uninline pmdp_xxx functions from pgtable.h")
Cc: <stable@vger.kernel.org> # v4.6+
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Right now there is a potential hang situation for postcopy migrations,
if the guest is enabling storage keys on the target system during the
postcopy process.
For storage key virtualization, we have to forbid the empty zero page as
the storage key is a property of the physical page frame. As we enable
storage key handling lazily we then drop all mappings for empty zero
pages for lazy refaulting later on.
This does not work with the postcopy migration, which relies on the
empty zero page never triggering a fault again in the future. The reason
is that postcopy migration will simply read a page on the target system
if that page is a known zero page to fault in an empty zero page. At
the same time postcopy remembers that this page was already transferred
- so any future userfault on that page will NOT be retransmitted again
to avoid races.
If now the guest enters the storage key mode while in postcopy, we will
break this assumption of postcopy.
The solution is to disable the empty zero page for KVM guests early on
and not during storage key enablement. With this change, the postcopy
migration process is guaranteed to start after no zero pages are left.
As guest pages are very likely not empty zero pages anyway the memory
overhead is also pretty small.
While at it this also adds proper page table locking to the zero page
removal.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add various defines like e.g. _REGION1_SHIFT to reflect the hardware
mmu. We have quite a bit code that does not make use of the Linux
memory management primitives but directly modifies page, segment and
region values.
Most of this is open-coded like e.g. "1UL << 53". In order to clean
this up introduce a couple of new defines. The existing Linux memory
management defines are changed, so the mapping to the hardware
implementation is reflected.
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
We have C code also outside of #ifndef __ASSEMBLY__. So these
guards seem to be quite pointless and can be removed.
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Introduce and use p?d_folded() functions to clarify the page table
code a bit more.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
_REGION3_ENTRY_ORIGIN defines a wrong mask which can be used to
extract a segment table origin from a region 3 table entry. It removes
only the lower 11 instead of 12 bits from a region 3 table entry.
Luckily this bit is currently always zero, so nothing bad happened yet.
In order to avoid future bugs just remove the region 3 specific mask
and use the correct generic _REGION_ENTRY_ORIGIN mask.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add the logic to upgrade the page table for a 64-bit process to
five levels. This increases the TASK_SIZE from 8PB to 16EB-4K.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull s390 updates from Martin Schwidefsky:
- three merges for KVM/s390 with changes for vfio-ccw and cpacf. The
patches are included in the KVM tree as well, let git sort it out.
- add the new 'trng' random number generator
- provide the secure key verification API for the pkey interface
- introduce the z13 cpu counters to perf
- add a new system call to set up the guarded storage facility
- simplify TASK_SIZE and arch_get_unmapped_area
- export the raw STSI data related to CPU topology to user space
- ... and the usual churn of bug-fixes and cleanups.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (74 commits)
s390/crypt: use the correct module alias for paes_s390.
s390/cpacf: Introduce kma instruction
s390/cpacf: query instructions use unique parameters for compatibility with KMA
s390/trng: Introduce s390 TRNG device driver.
s390/crypto: Provide s390 specific arch random functionality.
s390/crypto: Add new subfunctions to the cpacf PRNO function.
s390/crypto: Renaming PPNO to PRNO.
s390/pageattr: avoid unnecessary page table splitting
s390/mm: simplify arch_get_unmapped_area[_topdown]
s390/mm: make TASK_SIZE independent from the number of page table levels
s390/gs: add regset for the guarded storage broadcast control block
s390/kvm: Add use_cmma field to mm_context_t
s390/kvm: Add PGSTE manipulation functions
vfio: ccw: improve error handling for vfio_ccw_mdev_remove
vfio: ccw: remove unnecessary NULL checks of a pointer
s390/spinlock: remove compare and delay instruction
s390/spinlock: use atomic primitives for spinlocks
s390/cpumf: simplify detection of guest samples
s390/pci: remove forward declaration
s390/pci: increase the PCI_NR_FUNCTIONS default
...
Add PGSTE manipulation functions:
* set_pgste_bits sets specific bits in a PGSTE
* get_pgste returns the whole PGSTE
* pgste_perform_essa manipulates a PGSTE to set specific storage states
* ESSA_[SG]ET_* macros used to indicate the action for manipulate_pgste
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Reviewed-by: Janosch Frank <frankja@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On heavy paging with KSM I see guest data corruption. Turns out that
KSM will add pages to its tree, where the mapping return true for
pte_unused (or might become as such later). KSM will unmap such pages
and reinstantiate with different attributes (e.g. write protected or
special, e.g. in replace_page or write_protect_page)). This uncovered
a bug in our pagetable handling: We must remove the unused flag as
soon as an entry becomes present again.
Cc: stable@vger.kernel.org
Signed-of-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If an architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.
If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.
If an architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
_SEGMENT_ENTRY_INVALID denotes the invalid bit in a segment table
entry whereas _SEGMENT_ENTRY_EMPTY means that the value of the whole
entry is only the invalid bit, as the entry is completely empty.
Therefore we use _SEGMENT_ENTRY_INVALID only to check and set the
invalid bit with bitwise operations. _SEGMENT_ENTRY_EMPTY is only used
to check for (un)equality.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Both MACHINE_HAS_PFMF and MACHINE_HAS_HPAGE are just an alias for
MACHINE_HAS_EDAT1. So simply use MACHINE_HAS_EDAT1 instead.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Bit 0x100 of a page table, segment table of region table entry
can be used to disallow code execution for the virtual addresses
associated with the entry.
There is one tricky bit, the system call to return from a signal
is part of the signal frame written to the user stack. With a
non-executable stack this would stop working. To avoid breaking
things the protection fault handler checks the opcode that caused
the fault for 0x0a77 (sys_sigreturn) and 0x0aad (sys_rt_sigreturn)
and injects a system call. This is preferable to the alternative
solution with a stub function in the vdso because it works for
vdso=off and statically linked binaries as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Merge the __p[m|u]xdp_idte and __p[m|u]dp_idte_local functions into a
single __p[m|u]dp_idte function with an additional parameter.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Merge the __ptep_ipte and __ptep_ipte_local functions into a single
__ptep_ipte function with an additional parameter. The __pte_ipte_range
function is still extra as the while loops makes it hard to merge.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
VGIC implementation.
- s390: support for trapping software breakpoints, nested virtualization
(vSIE), the STHYI opcode, initial extensions for CPU model support.
- MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups,
preliminary to this and the upcoming support for hardware virtualization
extensions.
- x86: support for execute-only mappings in nested EPT; reduced vmexit
latency for TSC deadline timer (by about 30%) on Intel hosts; support for
more than 255 vCPUs.
- PPC: bugfixes.
The ugly bit is the conflicts. A couple of them are simple conflicts due
to 4.7 fixes, but most of them are with other trees. There was definitely
too much reliance on Acked-by here. Some conflicts are for KVM patches
where _I_ gave my Acked-by, but the worst are for this pull request's
patches that touch files outside arch/*/kvm. KVM submaintainers should
probably learn to synchronize better with arch maintainers, with the
latter providing topic branches whenever possible instead of Acked-by.
This is what we do with arch/x86. And I should learn to refuse pull
requests when linux-next sends scary signals, even if that means that
submaintainers have to rebase their branches.
Anyhow, here's the list:
- arch/x86/kvm/vmx.c: handle_pcommit and EXIT_REASON_PCOMMIT was removed
by the nvdimm tree. This tree adds handle_preemption_timer and
EXIT_REASON_PREEMPTION_TIMER at the same place. In general all mentions
of pcommit have to go.
There is also a conflict between a stable fix and this patch, where the
stable fix removed the vmx_create_pml_buffer function and its call.
- virt/kvm/kvm_main.c: kvm_cpu_notifier was removed by the hotplug tree.
This tree adds kvm_io_bus_get_dev at the same place.
- virt/kvm/arm/vgic.c: a few final bugfixes went into 4.7 before the
file was completely removed for 4.8.
- include/linux/irqchip/arm-gic-v3.h: this one is entirely our fault;
this is a change that should have gone in through the irqchip tree and
pulled by kvm-arm. I think I would have rejected this kvm-arm pull
request. The KVM version is the right one, except that it lacks
GITS_BASER_PAGES_SHIFT.
- arch/powerpc: what a mess. For the idle_book3s.S conflict, the KVM
tree is the right one; everything else is trivial. In this case I am
not quite sure what went wrong. The commit that is causing the mess
(fd7bacbca4, "KVM: PPC: Book3S HV: Fix TB corruption in guest exit
path on HMI interrupt", 2016-05-15) touches both arch/powerpc/kernel/
and arch/powerpc/kvm/. It's large, but at 396 insertions/5 deletions
I guessed that it wasn't really possible to split it and that the 5
deletions wouldn't conflict. That wasn't the case.
- arch/s390: also messy. First is hypfs_diag.c where the KVM tree
moved some code and the s390 tree patched it. You have to reapply the
relevant part of commits 6c22c98637, plus all of e030c1125e, to
arch/s390/kernel/diag.c. Or pick the linux-next conflict
resolution from http://marc.info/?l=kvm&m=146717549531603&w=2.
Second, there is a conflict in gmap.c between a stable fix and 4.8.
The KVM version here is the correct one.
I have pushed my resolution at refs/heads/merge-20160802 (commit
3d1f53419842) at git://git.kernel.org/pub/scm/virt/kvm/kvm.git.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJXoGm7AAoJEL/70l94x66DugQIAIj703ePAFepB/fCrKHkZZia
SGrsBdvAtNsOhr7FQ5qvvjLxiv/cv7CymeuJivX8H+4kuUHUllDzey+RPHYHD9X7
U6n1PdCH9F15a3IXc8tDjlDdOMNIKJixYuq1UyNZMU6NFwl00+TZf9JF8A2US65b
x/41W98ilL6nNBAsoDVmCLtPNWAqQ3lajaZELGfcqRQ9ZGKcAYOaLFXHv2YHf2XC
qIDMf+slBGSQ66UoATnYV2gAopNlWbZ7n0vO6tE2KyvhHZ1m399aBX1+k8la/0JI
69r+Tz7ZHUSFtmlmyByi5IAB87myy2WQHyAPwj+4vwJkDGPcl0TrupzbG7+T05Y=
=42ti
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
- ARM: GICv3 ITS emulation and various fixes. Removal of the
old VGIC implementation.
- s390: support for trapping software breakpoints, nested
virtualization (vSIE), the STHYI opcode, initial extensions
for CPU model support.
- MIPS: support for MIPS64 hosts (32-bit guests only) and lots
of cleanups, preliminary to this and the upcoming support for
hardware virtualization extensions.
- x86: support for execute-only mappings in nested EPT; reduced
vmexit latency for TSC deadline timer (by about 30%) on Intel
hosts; support for more than 255 vCPUs.
- PPC: bugfixes.
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
KVM: PPC: Introduce KVM_CAP_PPC_HTM
MIPS: Select HAVE_KVM for MIPS64_R{2,6}
MIPS: KVM: Reset CP0_PageMask during host TLB flush
MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
MIPS: KVM: Sign extend MFC0/RDHWR results
MIPS: KVM: Fix 64-bit big endian dynamic translation
MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
MIPS: KVM: Use 64-bit CP0_EBase when appropriate
MIPS: KVM: Set CP0_Status.KX on MIPS64
MIPS: KVM: Make entry code MIPS64 friendly
MIPS: KVM: Use kmap instead of CKSEG0ADDR()
MIPS: KVM: Use virt_to_phys() to get commpage PFN
MIPS: Fix definition of KSEGX() for 64-bit
KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
kvm: x86: nVMX: maintain internal copy of current VMCS
KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
KVM: arm64: vgic-its: Simplify MAPI error handling
KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
...
The hugetlbfs pte<->pmd conversion functions currently assume that the pmd
bit layout is consistent with the pte layout, which is not really true.
The SW read and write bits are encoded as the sequence "wr" in a pte, but
in a pmd it is "rw". The hugetlbfs conversion assumes that the sequence
is identical in both cases, which results in swapped read and write bits
in the pmd. In practice this is not a problem, because those pmd bits are
only relevant for THP pmds and not for hugetlbfs pmds. The hugetlbfs code
works on (fake) ptes, and the converted pte bits are correct.
There is another variation in pte/pmd encoding which affects dirty
prot-none ptes/pmds. In this case, a pmd has both its HW read-only and
invalid bit set, while it is only the invalid bit for a pte. This also has
no effect in practice, but it should better be consistent.
This patch fixes both inconsistencies by changing the SW read/write bit
layout for pmds as well as the PAGE_NONE encoding for ptes. It also makes
the hugetlbfs conversion functions more robust by introducing a
move_set_bit() macro that uses the pte/pmd bit #defines instead of
constant shifts.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This adds support for 2GB hugetlbfs pages on s390.
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
We really want to avoid manually handling protection for nested
virtualization. By shadowing pages with the protection the guest asked us
for, the SIE can handle most protection-related actions for us (e.g.
special handling for MVPG) and we can directly forward protection
exceptions to the guest.
PTEs will now always be shadowed with the correct _PAGE_PROTECT flag.
Unshadowing will take care of any guest changes to the parent PTE and
any host changes to the host PTE. If the host PTE doesn't have the
fitting access rights or is not available, we have to fix it up.
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
For a nested KVM guest the outer KVM host needs to create shadow
page tables for the nested guest. This patch adds the basic support
to the guest address space (gmap) code.
For each guest address space the inner KVM host creates, the first
outer KVM host needs to create shadow page tables. The address space
is identified by the ASCE loaded into the control register 1 at the
time the inner SIE instruction for the second nested KVM guest is
executed. The outer KVM host creates the shadow tables starting with
the table identified by the ASCE on a on-demand basis. The outer KVM
host will get repeated faults for all the shadow tables needed to
run the second KVM guest.
While a shadow page table for the second KVM guest is active the access
to the origin region, segment and page tables needs to be restricted
for the first KVM guest. For region and segment and page tables the first
KVM guest may read the memory, but write attempt has to lead to an
unshadow. This is done using the page invalid and read-only bits in the
page table of the first KVM guest. If the first guest re-accesses one of
the origin pages of a shadow, it gets a fault and the affected parts of
the shadow page table hierarchy needs to be removed again.
PGSTE tables don't have to be shadowed, as all interpretation assist can't
deal with the invalid bits in the shadow pte being set differently than
the original ones provided by the first KVM guest.
Many bug fixes and improvements by David Hildenbrand.
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The current gmap pte notifier forces a pte into to a read-write state.
If the pte is invalidated the gmap notifier is called to inform KVM
that the mapping will go away.
Extend this approach to allow read-write, read-only and no-access
as possible target states and call the pte notifier for any change
to the pte.
This mechanism is used to temporarily set specific access rights for
a pte without doing the heavy work of a true mprotect call.
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The segment/region table that is part of the kernel image must be
properly aligned to 16k in order to make the crdte inline assembly
work.
Otherwise it will calculate a wrong segment/region table start address
and access incorrect memory locations if the swapper_pg_dir is not
aligned to 16k.
Therefore define BSS_FIRST_SECTIONS in order to put the swapper_pg_dir
at the beginning of the bss section and also align the bss section to
16k just like other architectures did.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add statistics that show how memory is mapped within the kernel
identity mapping. This is more or less the same like git
commit ce0c0e50f9 ("x86, generic: CPA add statistics about state
of direct mapping v4") for x86.
I also intentionally copied the lower case "k" within DirectMap4k vs
the upper case "M" and "G" within the two other lines. Let's have
consistent inconsistencies across architectures.
The output of /proc/meminfo now contains these additional lines:
DirectMap4k: 2048 kB
DirectMap1M: 3991552 kB
DirectMap2G: 4194304 kB
The implementation on s390 is lockless unlike the x86 version, since I
assume changes to the kernel mapping are a very rare event. Therefore
it really doesn't matter if these statistics could potentially be
inconsistent if read while kernel pages tables are being changed.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
set_memory_ro() and set_memory_rw() currently only work on 4k
mappings, which is good enough for module code aka the vmalloc area.
However we stumbled already twice into the need to make this also work
on larger mappings:
- the ro after init patch set
- the crash kernel resize code
Therefore this patch implements automatic kernel page table splitting
if e.g. set_memory_ro() would be called on parts of a 2G mapping.
This works quite the same as the x86 code, but is much simpler.
In order to make this work and to be architecturally compliant we now
always use the csp, cspg or crdte instructions to replace valid page
table entries. This means that set_memory_ro() and set_memory_rw()
will be much more expensive than before. In order to avoid huge
latencies the code contains a couple of cond_resched() calls.
The current code only splits page tables, but does not merge them if
it would be possible. The reason for this is that currently there is
no real life scenarion where this would really happen. All current use
cases that I know of only change access rights once during the life
time. If that should change we can still implement kernel page table
merging at a later time.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make pmd_wrprotect() and pmd_mkwrite() available independently from
CONFIG_TRANSPARENT_HUGEPAGE and CONFIG_HUGETLB_PAGE so these can be
used on the kernel mapping.
Also introduce a couple of pud helper functions, namely pud_pfn(),
pud_wrprotect(), pud_mkwrite(), pud_mkdirty() and pud_mkclean()
which only work on the kernel mapping.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
_REGION3_ENTRY_RO is a duplicate of _REGION_ENTRY_PROTECT.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Instead of open-coded SEGMENT_KERNEL and REGION3_KERNEL assignments use
defines. Also to make e.g. pmd_wrprotect() work on the kernel mapping
a couple more flags must be set. Therefore add the missing flags also.
In order to make everything symmetrical this patch also adds software
dirty, young, read and write bits for region 3 table entries.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
We have already two inline assemblies which make use of the csp
instruction. Since I need a third instance let's introduce a generic
inline assmebly which can be used by everyone.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Without the storage-key facility, SIE won't interpret SSKE, ISKE and
RRBE for us. So let's add proper interception handlers that will be called
if lazy sske cannot be enabled.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
We already indicate that facility but don't implement it in our pfmf
interception handler. Let's add a new storage key handling function for
conditionally setting the guest storage key.
As we will reuse this function later on, let's directly implement returning
the old key via parameter and indicating if any change happened via rc.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Let's just split returning the key and reporting errors. This makes calling
code easier and avoids bugs as happened already.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Commit 1e133ab296 ("s390/mm: split arch/s390/mm/pgtable.c") changed
the return value of get_guest_storage_key to an unsigned char, resulting
in -EFAULT getting interpreted as a valid storage key.
Cc: stable@vger.kernel.org # 4.6+
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
I've just discovered that the useful-sounding has_transparent_hugepage()
is actually an architecture-dependent minefield: on some arches it only
builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
not, but on some of those (arm and arm64) it then gives the wrong
answer; and on mips alone it's marked __init, which would crash if
called later (but so far it has not been called later).
Straighten this out: make it available to all configs, with a sensible
default in asm-generic/pgtable.h, removing its definitions from those
arches (arc, arm, arm64, sparc, tile) which are served by the default,
adding #define has_transparent_hugepage has_transparent_hugepage to
those (mips, powerpc, s390, x86) which need to override the default at
runtime, and removing the __init from mips (but maybe that kind of code
should be avoided after init: set a static variable the first time it's
called).
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [arch/s390]
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pgtable.c file is quite big, before it grows any larger split it
into pgtable.c, pgalloc.c and gmap.c. In addition move the gmap related
header definitions into the new gmap.h header and all of the pgste
helpers from pgtable.h to pgtable.c.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The pmdp_xxx function are smaller than their ptep_xxx counterparts
but to keep things symmetrical unline them as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The code in the various ptep_xxx functions has grown quite large,
consolidate them to four out-of-line functions:
ptep_xchg_direct to exchange a pte with another with immediate flushing
ptep_xchg_lazy to exchange a pte with another in a batched update
ptep_modify_prot_start to begin a protection flags update
ptep_modify_prot_commit to commit a protection flags update
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The comment describing the bit encoding for segment table entries
is incorrect in regard to the read and write bits. The segment
read bit is 0x0002 and write is 0x0001, not the other way around.
Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The first parameter of pgste_update_all is a pointer to a pte.
Simplify the code by passing the pte value.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.
pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The call to pgste_set_key in ptep_set_access_flags can be avoided
if the old pte is found to be valid at the time the new access
rights are set. The function that created the old, valid pte already
completed the required storage key operation.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use bit 2**1 of the pte and bit 2**14 of the pmd for the soft dirty
bit. The fault mechanism to do dirty tracking is already in place.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.
We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also move the pmd_trans_huge check to generic code.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull s390 updates from Martin Schwidefsky:
"Pretty boring for a merge window pull.
One change in behaviour is the patch for dasd driver, the module which
provides the diagnose discipline is now loaded automatically.
The SCLP code got a nice cleanup, a new global structure replaces a
bunch of accessor functions.
And a couple of random, small improvements"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: improve handling of hotplug event 0x301
s390/setup: fix DMA_API_DEBUG warnings
s390/zcrypt: remove obsolete __constant
s390/keyboard: avoid off-by-one when using strnlen_user()
s390/sclp: pass timeout as HZ independent value
s390/mm: s/specifiation/specification/, s/an specification/a specification/
s390/sclp: Use DECLARE_BITMAP
s390/dasd: Enable automatic loading of dasd_diag_mod
s390/sclp: move sclp_facilities into "struct sclp"
s390/sclp: get rid of sclp_get_mtid() and sclp_get_mtid_max()
s390/sclp: unify basic sclp access by exposing "struct sclp"
s390/sclp: prepare smp_fill_possible_mask for global "struct sclp"
Git commit 152125b7a8
"s390/mm: implement dirty bits for large segment table entries"
broke the pmd_pfn function, it changed the return value from
'unsigned long' to 'int'. This breaks all machine configurations
with memory above the 8TB line.
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
After the file ptes have been removed the bit combination used to
encode non-linear mappings can be reused for the swap ptes. This
frees up a precious pte software bit. Reflect the change in the
swap encoding in the comments and do some cleanup while we are
at it.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Replacing a 2K page table with a 4K page table while a VMA is active
for the affected memory region is fundamentally broken. Rip out the
page table reallocation code and replace it with a simple system
control 'vm.allocate_pgste'. If the system control is set the page
tables for all processes are allocated as full 4K pages, even for
processes that do not need it.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove the 31 bit support in order to reduce maintenance cost and
effectively remove dead code. Since a couple of years there is no
distribution left that comes with a 31 bit kernel.
The 31 bit kernel also has been broken since more than a year before
anybody noticed. In addition I added a removal warning to the kernel
shown at ipl for 5 minutes: a960062e58 ("s390: add 31 bit warning
message") which let everybody know about the plan to remove 31 bit
code. We didn't get any response.
Given that the last 31 bit only machine was introduced in 1999 let's
remove the code.
Anybody with 31 bit user space code can still use the compat mode.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Core mm expects __PAGETABLE_{PUD,PMD}_FOLDED to be defined if these page
table levels folded. Usually, these defines are provided by
<asm-generic/pgtable-nopmd.h> and <asm-generic/pgtable-nopud.h>.
But some architectures fold page table levels in a custom way. They
need to define these macros themself. This patch adds missing defines.
The patch fixes mm->nr_pmds underflow and eliminates dead __pmd_alloc()
and __pud_alloc() on architectures without these page table levels.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aaro Koskinen <aaro.koskinen@iki.fi>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge second set of updates from Andrew Morton:
"More of MM"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
vmstat: Reduce time interval to stat update on idle cpu
mm/page_owner.c: remove unnecessary stack_trace field
Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files
mm: incorporate read-only pages into transparent huge pages
vmstat: do not use deferrable delayed work for vmstat_update
mm: more aggressive page stealing for UNMOVABLE allocations
mm: always steal split buddies in fallback allocations
mm: when stealing freepages, also take pages created by splitting buddy page
mincore: apply page table walker on do_mincore()
mm: /proc/pid/clear_refs: avoid split_huge_page()
mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
mempolicy: apply page table walker on queue_pages_range()
arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
memcg: cleanup preparation for page table walk
numa_maps: remove numa_maps->vma
numa_maps: fix typo in gather_hugetbl_stats
pagemap: use walk->vma instead of calling find_vma()
clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
...
Pull s390 updates from Martin Schwidefsky:
- The remaining patches for the z13 machine support: kernel build
option for z13, the cache synonym avoidance, SMT support,
compare-and-delay for spinloops and the CES5S crypto adapater.
- The ftrace support for function tracing with the gcc hotpatch option.
This touches common code Makefiles, Steven is ok with the changes.
- The hypfs file system gets an extension to access diagnose 0x0c data
in user space for performance analysis for Linux running under z/VM.
- The iucv hvc console gets wildcard spport for the user id filtering.
- The cacheinfo code is converted to use the generic infrastructure.
- Cleanup and bug fixes.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
s390/process: free vx save area when releasing tasks
s390/hypfs: Eliminate hypfs interval
s390/hypfs: Add diagnose 0c support
s390/cacheinfo: don't use smp_processor_id() in preemptible context
s390/zcrypt: fixed domain scanning problem (again)
s390/smp: increase maximum value of NR_CPUS to 512
s390/jump label: use different nop instruction
s390/jump label: add sanity checks
s390/mm: correct missing space when reporting user process faults
s390/dasd: cleanup profiling
s390/dasd: add locking for global_profile access
s390/ftrace: hotpatch support for function tracing
ftrace: let notrace function attribute disable hotpatching if necessary
ftrace: allow architectures to specify ftrace compile options
s390: reintroduce diag 44 calls for cpu_relax()
s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
s390/zcrypt: Number of supported ap domains is not retrievable.
s390/spinlock: add compare-and-delay to lock wait loops
s390/tape: remove redundant if statement
s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
...
LKP has triggered a compiler warning after my recent patch "mm: account
pmd page tables to the process":
mm/mmap.c: In function 'exit_mmap':
>> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]
The code:
> 2857 WARN_ON(mm_nr_pmds(mm) >
2858 round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
In this, on tile, we have FIRST_USER_ADDRESS defined as 0. round_up() has
the same type -- int. PUD_SHIFT.
I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
long. On every arch for consistency.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid cache aliasing on z13 by aligning shared objects to multiples
of 512K. The virtual addresses of a page from a shared file needs
to have identical bits in the range 2^12 to 2^18.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Analog to ptep_get_and_clear_full define a variant of the
pmpd_get_and_clear primitive which gets the full hint from the
mmu_gather struct. This allows s390 to avoid a costly instruction
when destroying an address space.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If the function tracer is enabled, allow to set kprobes on the first
instruction of a function (which is the function trace caller):
If no kprobe is set handling of enabling and disabling function tracing
of a function simply patches the first instruction. Either it is a nop
(right now it's an unconditional branch, which skips the mcount block),
or it's a branch to the ftrace_caller() function.
If a kprobe is being placed on a function tracer calling instruction
we encode if we actually have a nop or branch in the remaining bytes
after the breakpoint instruction (illegal opcode).
This is possible, since the size of the instruction used for the nop
and branch is six bytes, while the size of the breakpoint is only
two bytes.
Therefore the first two bytes contain the illegal opcode and the last
four bytes contain either "0" for nop or "1" for branch. The kprobes
code will then execute/simulate the correct instruction.
Instruction patching for kprobes and function tracer is always done
with stop_machine(). Therefore we don't have any races where an
instruction is patched concurrently on a different cpu.
Besides that also the program check handler which executes the function
trace caller instruction won't be executed concurrently to any
stop_machine() execution.
This allows to keep full fault based kprobes handling which generates
correct pt_regs contents automatically.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
When storage keys are enabled unmerge already merged pages and prevent
new pages from being merged.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
As soon as storage keys are enabled we need to stop working on zero page
mappings to prevent inconsistencies between storage keys and pgste.
Otherwise following data corruption could happen:
1) guest enables storage key
2) guest sets storage key for not mapped page X
-> change goes to PGSTE
3) guest reads from page X
-> as X was not dirty before, the page will be zero page backed,
storage key from PGSTE for X will go to storage key for zero page
4) guest sets storage key for not mapped page Y (same logic as above
5) guest reads from page Y
-> as Y was not dirty before, the page will be zero page backed,
storage key from PGSTE for Y will got to storage key for zero page
overwriting storage key for X
While holding the mmap sem, we are safe against changes on entries we
already fixed, as every fault would need to take the mmap_sem (read).
Other vCPUs executing storage key instructions will get a one time interception
and be serialized also with mmap_sem.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Replace the s390 specific page table walker for the pgste updates
with a call to the common code walk_page_range function.
There are now two pte modification functions, one for the reset
of the CMMA state and another one for the initialization of the
storage keys.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull s390 updates from Martin Schwidefsky:
"This patch set contains the main portion of the changes for 3.18 in
regard to the s390 architecture. It is a bit bigger than usual,
mainly because of a new driver and the vector extension patches.
The interesting bits are:
- Quite a bit of work on the tracing front. Uprobes is enabled and
the ftrace code is reworked to get some of the lost performance
back if CONFIG_FTRACE is enabled.
- To improve boot time with CONFIG_DEBIG_PAGEALLOC, support for the
IPTE range facility is added.
- The rwlock code is re-factored to improve writer fairness and to be
able to use the interlocked-access instructions.
- The kernel part for the support of the vector extension is added.
- The device driver to access the CD/DVD on the HMC is added, this
will hopefully come in handy to improve the installation process.
- Add support for control-unit initiated reconfiguration.
- The crypto device driver is enhanced to enable the additional AP
domains and to allow the new crypto hardware to be used.
- Bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
s390/ftrace: simplify enabling/disabling of ftrace_graph_caller
s390/ftrace: remove 31 bit ftrace support
s390/kdump: add support for vector extension
s390/disassembler: add vector instructions
s390: add support for vector extension
s390/zcrypt: Toleration of new crypto hardware
s390/idle: consolidate idle functions and definitions
s390/nohz: use a per-cpu flag for arch_needs_cpu
s390/vtime: do not reset idle data on CPU hotplug
s390/dasd: add support for control unit initiated reconfiguration
s390/dasd: fix infinite loop during format
s390/mm: make use of ipte range facility
s390/setup: correct 4-level kernel page table detection
s390/topology: call set_sched_topology early
s390/uprobes: architecture backend for uprobes
s390/uprobes: common library for kprobes and uprobes
s390/rwlock: use the interlocked-access facility 1 instructions
s390/rwlock: improve writer fairness
s390/rwlock: remove interrupt-enabling rwlock variant.
s390/mm: remove change bit override support
...
Apart from the usual cleanups, here is the summary of new features:
- s390 moves closer towards host large page support
- PowerPC has improved support for debugging (both inside the guest and
via gdbstub) and support for e6500 processors
- ARM/ARM64 support read-only memory (which is necessary to put firmware
in emulated NOR flash)
- x86 has the usual emulator fixes and nested virtualization improvements
(including improved Windows support on Intel and Jailhouse hypervisor
support on AMD), adaptive PLE which helps overcommitting of huge guests.
Also included are some patches that make KVM more friendly to memory
hot-unplug, and fixes for rare caching bugs.
Two patches have trivial mm/ parts that were acked by Rik and Andrew.
Note: I will soon switch to a subkey for signing purposes. To verify
future signed pull requests from me, please update my key with
"gpg --recv-keys 9B4D86F2". You should see 3 new subkeys---the
one for signing will be a 2048-bit RSA key, 4E6B09D7.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJUL5sPAAoJEBvWZb6bTYbyfkEP/3MNhSyn6HCjPjtjLNPAl9KL
WpExZSUFL2+4CztpdGIsek1BeJYHmqv3+c5S+WvaWVA1aqh2R7FT1D1ErBLjgLQq
lq23IOr+XxmC3dXQUEEk+TlD+283UzypzEG4l4UD3JYg79fE3UrXAz82SeyewJDY
x7aPYhkZG3RHu+wAyMPasG6E3zS5LySdUtGWbiPwz5BejrhBJoJdeb2WIL/RwnUK
7ppSLB5EoFj/uMkuyeAAdAbdfSrhHA6faDZxNdxS9k9wGutrhhfUoQ49ONrKG4dV
sFo1tSPTVgRs8QFYUZ2fJUPBAmUVddsgqh2K9d0NftGTq7b8YszaCsfFrs2/Y4MU
YxssWEhxsfszerCu12bbAJrv6JBZYQ7TwGvI9L7P0iFU6IVw/djmukU4AkM9/e91
YS/cue/PN+9Pn2ccXzL9J7xRtZb8FsOuRsCXTCmbOwDkLmrKPDBN2t3RUbeF+Eam
ABrpWnLKX13kZSo4LKU+/niarzmPMp7odQfHVdr8ea0fiYLp4iN8puA20WaSPIgd
CLvm+RAvXe5Lm91L4mpFotJ2uFyK6QlIYJV4FsgeWv/0D0qppWQi0Utb/aCNHCgy
z8MyUMD48y7EpoQrFYr/7cddXIu0/NegnM8I1coVjIPEk4NfeebGUlCJ/V3D8wMG
BgEfS2x6jRc5zB3hjwDr
=iEVi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Fixes and features for 3.18.
Apart from the usual cleanups, here is the summary of new features:
- s390 moves closer towards host large page support
- PowerPC has improved support for debugging (both inside the guest
and via gdbstub) and support for e6500 processors
- ARM/ARM64 support read-only memory (which is necessary to put
firmware in emulated NOR flash)
- x86 has the usual emulator fixes and nested virtualization
improvements (including improved Windows support on Intel and
Jailhouse hypervisor support on AMD), adaptive PLE which helps
overcommitting of huge guests. Also included are some patches that
make KVM more friendly to memory hot-unplug, and fixes for rare
caching bugs.
Two patches have trivial mm/ parts that were acked by Rik and Andrew.
Note: I will soon switch to a subkey for signing purposes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (157 commits)
kvm: do not handle APIC access page if in-kernel irqchip is not in use
KVM: s390: count vcpu wakeups in stat.halt_wakeup
KVM: s390/facilities: allow TOD-CLOCK steering facility bit
KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode
arm/arm64: KVM: Report correct FSC for unsupported fault types
arm/arm64: KVM: Fix VTTBR_BADDR_MASK and pgd alloc
kvm: Fix kvm_get_page_retry_io __gup retval check
arm/arm64: KVM: Fix set_clear_sgi_pend_reg offset
kvm: x86: Unpin and remove kvm_arch->apic_access_page
kvm: vmx: Implement set_apic_access_page_addr
kvm: x86: Add request bit to reload APIC access page address
kvm: Add arch specific mmu notifier for page invalidation
kvm: Rename make_all_cpus_request() to kvm_make_all_cpus_request() and make it non-static
kvm: Fix page ageing bugs
kvm/x86/mmu: Pass gfn and level to rmapp callback.
x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
kvm: x86: use macros to compute bank MSRs
KVM: x86: Remove debug assertion of non-PAE reserved bits
kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
kvm: Faults which trigger IO release the mmap_sem
...
Invalidate several pte entries at once if the ipte range facility
is available. Currently this works only for DEBUG_PAGE_ALLOC where
several up to 2 ^ MAX_ORDER may be invalidated at once.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
commit 0944fe3f4a ("s390/mm: implement software referenced bits")
triggered another paging/storage key corruption. There is an
unhandled invalid->valid pte change where we have to set the real
storage key from the pgste.
When doing paging a guest page might be swapcache or swap and when
faulted in it might be read-only and due to a parallel scan old.
An do_wp_page will make it writeable and young. Due to software
reference tracking this page was invalid and now becomes valid.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: stable@vger.kernel.org # v3.12+
Since 3.12 or more precisely commit 0944fe3f4a ("s390/mm:
implement software referenced bits") guest storage keys get
corrupted during paging. This commit added another valid->invalid
translation for page tables - namely ptep_test_and_clear_young.
We have to transfer the storage key into the pgste in that case.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: stable@vger.kernel.org # v3.12+
The radix tree rework removed all code that uses the gmap_rmap
and gmap_pgtable data structures. Remove these outdated definitions.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Add an addressing limit to the gmap address spaces and only allocate
the page table levels that are needed for the given limit. The limit
is fixed and can not be changed after a gmap has been created.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Store the target address for the gmap segments in a radix tree
instead of using invalid segment table entries. gmap_translate
becomes a simple radix_tree_lookup, gmap_fault is split into the
address translation with gmap_translate and the part that does
the linking of the gmap shadow page table with the process page
table.
A second radix tree is used to keep the pointers to the segment
table entries for segments that are mapped in the guest address
space. On unmap of a segment the pointer is retrieved from the
radix tree and is used to carry out the segment invalidation in
the gmap shadow page table. As the radix tree can only store one
pointer, each host segment may only be mapped to exactly one
guest location.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Make the order of arguments for the gmap calls more consistent,
if the gmap pointer is passed it is always the first argument.
In addition distinguish between guest address and user address
by naming the variables gaddr for a guest address and vmaddr for
a user address.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The large segment table entry format has block of bits for the
ACC/F values for the large page. These bits are valid only if
another bit (AV bit 0x10000) of the segment table entry is set.
The ACC/F bits do not have a meaning if the AV bit is off.
This allows to put the THP splitting bit, the segment young bit
and the new segment dirty bit into the ACC/F bits as long as
the AV bit stays off. The dirty and young information is only
available if the pmd is large.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
For live migration kvm needs to test and clear the dirty bit of guest pages.
That for is ptep_test_and_clear_user_dirty, to be sure we are not racing with
other code, we protect the pte. This needs to be done within
the architecture memory management code.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Switch the user dirty bit detection used for migration from the hardware
provided host change-bit in the pgste to a fault based detection method.
This reduced the dependency of the host from the storage key to a point
where it becomes possible to enable the RCP bypass for KVM guests.
The fault based dirty detection will only indicate changes caused
by accesses via the guest address space. The hardware based method
can detect all changes, even those caused by I/O or accesses via the
kernel page table. The KVM/qemu code needs to take this into account.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Introduce a new function s390_enable_skey(), which enables storage key
handling via setting the use_skey flag in the mmu context.
This function is only useful within the context of kvm.
Note that enabling storage keys will cause a one-time hickup when
walking the page table; however, it saves us special effort for cases
like clear reset while making it possible for us to be architecture
conform.
s390_enable_skey() takes the page table lock to prevent reseting
storage keys triggered from multiple vcpus.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
For lazy storage key handling, we need a mechanism to track if the
process ever issued a storage key operation.
This patch adds the basic infrastructure for making the storage
key handling optional, but still leaves it enabled for now by default.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Pull second set of s390 patches from Martin Schwidefsky:
"The second part of Heikos uaccess rework, the page table walker for
uaccess is now a thing of the past (yay!)
The code change to fix the theoretical TLB flush problem allows us to
add a TLB flush optimization for zEC12, this machine has new
instructions that allow to do CPU local TLB flushes for single pages
and for all pages of a specific address space.
Plus the usual bug fixing and some more cleanup"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/uaccess: rework uaccess code - fix locking issues
s390/mm,tlb: optimize TLB flushing for zEC12
s390/mm,tlb: safeguard against speculative TLB creation
s390/irq: Use defines for external interruption codes
s390/irq: Add defines for external interruption codes
s390/sclp: add timeout for queued requests
kvm/s390: also set guest pages back to stable on kexec/kdump
lcs: Add missing destroy_timer_on_stack()
s390/tape: Add missing destroy_timer_on_stack()
s390/tape: Use del_timer_sync()
s390/3270: fix crash with multiple reset device requests
s390/bitops,atomic: add missing memory barriers
s390/zcrypt: add length check for aligned data to avoid overflow in msg-type 6
The zEC12 machines introduced the local-clearing control for the IDTE
and IPTE instruction. If the control is set only the TLB of the local
CPU is cleared of entries, either all entries of a single address space
for IDTE, or the entry for a single page-table entry for IPTE.
Without the local-clearing control the TLB flush is broadcasted to all
CPUs in the configuration, which is expensive.
The reset of the bit mask of the CPUs that need flushing after a
non-local IDTE is tricky. As TLB entries for an address space remain
in the TLB even if the address space is detached a new bit field is
required to keep track of attached CPUs vs. CPUs in the need of a
flush. After a non-local flush with IDTE the bit-field of attached CPUs
is copied to the bit-field of CPUs in need of a flush. The ordering
of operations on cpu_attach_mask, attach_count and mm_cpumask(mm) is
such that an underindication in mm_cpumask(mm) is prevented but an
overindication in mm_cpumask(mm) is possible.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull kvm updates from Paolo Bonzini:
"PPC and ARM do not have much going on this time. Most of the cool
stuff, instead, is in s390 and (after a few releases) x86.
ARM has some caching fixes and PPC has transactional memory support in
guests. MIPS has some fixes, with more probably coming in 3.16 as
QEMU will soon get support for MIPS KVM.
For x86 there are optimizations for debug registers, which trigger on
some Windows games, and other important fixes for Windows guests. We
now expose to the guest Broadwell instruction set extensions and also
Intel MPX. There's also a fix/workaround for OS X guests, nested
virtualization features (preemption timer), and a couple kvmclock
refinements.
For s390, the main news is asynchronous page faults, together with
improvements to IRQs (floating irqs and adapter irqs) that speed up
virtio devices"
* tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (96 commits)
KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8
KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset
KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
KVM: PPC: Book3S HV: Return ENODEV error rather than EIO
KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code
KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state
KVM: PPC: Book3S HV: Add transactional memory support
KVM: Specify byte order for KVM_EXIT_MMIO
KVM: vmx: fix MPX detection
KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n
KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE
KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write
KVM: s390: clear local interrupts at cpu initial reset
KVM: s390: Fix possible memory leak in SIGP functions
KVM: s390: fix calculation of idle_mask array size
KVM: s390: randomize sca address
KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
KVM: Bump KVM_MAX_IRQ_ROUTES for s390
KVM: s390: irq routing for adapter interrupts.
KVM: s390: adapter interrupt sources
...
This patch enables Collaborative Memory Management (CMM) for kvm
on s390. CMM allows the guest to inform the host about page usage
(see arch/s390/mm/cmm.c). The host uses this information to avoid
swapping in unused pages in the page fault handler. Further, a CPU
provided list of unused invalid pages is processed to reclaim swap
space of not yet accessed unused pages.
[ Martin Schwidefsky: patch reordering and cleanup ]
Signed-off-by: Konstantin Weitz <konstantin.weitz@gmail.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Git commit 050eef364a "[S390] fix tlb flushing vs. concurrent
/proc accesses" introduced the attach counter to avoid using the
mm_users value to decide between IPTE for every PTE and lazy TLB
flushing with IDTE. That fixed the problem with mm_users but it
introduced another subtle race, fortunately one that is very hard
to hit.
The background is the requirement of the architecture that a valid
PTE may not be changed while it can be used concurrently by another
cpu. The decision between IPTE and lazy TLB flushing needs to be
done while the PTE is still valid. Now if the virtual cpu is
temporarily stopped after the decision to use lazy TLB flushing but
before the invalid bit of the PTE has been set, another cpu can attach
the mm, find that flush_mm is set, do the IDTE, return to userspace,
and recreate a TLB that uses the PTE in question. When the first,
stopped cpu continues it will change the PTE while it is attached on
another cpu. The first cpu will do another IDTE shortly after the
modification of the PTE which makes the race window quite short.
To fix this race the CPU that wants to attach the address space of a
user space thread needs to wait for the end of the PTE modification.
The number of concurrent TLB flushers for an mm is tracked in the
upper 16 bits of the attach_count and finish_arch_post_lock_switch
is used to wait for the end of the flush operation if required.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
In the case of a fault, we will retry to exit sie64 but with gmap fault
indication for this thread set. This makes it possible to handle async
page faults.
Based on a patch from Martin Schwidefsky.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
For machines without enhanced supression on protection the software
dirty bit code forces the pte dirty bit and clears the page protection
bit in pgste_set_pte. This is done for all pte types, the check for
present ptes is missing. As a result swap ptes and other not-present
ptes can get corrupted.
Add a check for the _PAGE_PRESENT bit to pgste_set_pte before modifying
the pte value.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull KVM updates from Gleb Natapov:
"The highlights of the release are nested EPT and pv-ticketlocks
support (hypervisor part, guest part, which is most of the code, goes
through tip tree). Apart of that there are many fixes for all arches"
Fix up semantic conflicts as discussed in the pull request thread..
* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits)
ARM: KVM: Add newlines to panic strings
ARM: KVM: Work around older compiler bug
ARM: KVM: Simplify tracepoint text
ARM: KVM: Fix kvm_set_pte assignment
ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256
ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs
ARM: KVM: vgic: fix GICD_ICFGRn access
ARM: KVM: vgic: simplify vgic_get_target_reg
KVM: MMU: remove unused parameter
KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()
KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls
KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX
KVM: x86: update masterclock when kvmclock_offset is calculated (v2)
KVM: PPC: Book3S: Fix compile error in XICS emulation
KVM: PPC: Book3S PR: return appropriate error when allocation fails
arch: powerpc: kvm: add signed type cast for comparation
KVM: x86: add comments where MMIO does not return to the emulator
KVM: vmx: count exits to userspace during invalid guest emulation
KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp
kvm: optimize away THP checks in kvm_is_mmio_pfn()
...
The last remaining use for the storage key of the s390 architecture
is reference counting. The alternative is to make page table entries
invalid while they are old. On access the fault handler marks the
pte/pmd as young which makes the pte/pmd valid if the access rights
allow read access. The pte/pmd invalidations required for software
managed reference bits cost a bit of performance, on the other hand
the RRBE/RRBM instructions to read and reset the referenced bits are
quite expensive as well.
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
For a single-threaded KVM guest ptep_modify_prot_start will not use
IPTE, the invalid bit will therefore not be set. If DEBUG_VM is set
pgste_set_key called by ptep_modify_prot_commit will complain about
the missing invalid bit. ptep_modify_prot_start should set the
invalid bit in all cases.
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Isolate the logic of IDTE vs. IPTE flushing of ptes in two functions,
ptep_flush_lazy and __tlb_flush_mm_lazy.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
ptep_modify_prot_start uses the pgste_set helper to store the pgste,
while ptep_modify_prot_commit uses its own pointer magic to retrieve
the value again. Add the pgste_get help function to keep things
symmetrical and improve readability.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On process exit there is no more need for the pgste information.
Skip the expensive storage key operations which should speed up
termination of KVM processes.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Improve the encoding of the different pte types and the naming of the
page, segment table and region table bits. Due to the different pte
encoding the hugetlbfs primitives need to be adapted as well. To improve
compatability with common code make the huge ptes use the encoding of
normal ptes. The conversion between the pte and pmd encoding for a huge
pte is done with set_huge_pte_at and huge_ptep_get.
Overall the code is now easier to understand.
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Improve the code to upgrade the standard 2K page tables to 4K page tables
with PGSTEs to allow the operation to happen when the program is already
multi-threaded.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull powerpc updates from Ben Herrenschmidt:
"This is the powerpc changes for the 3.11 merge window. In addition to
the usual bug fixes and small updates, the main highlights are:
- Support for transparent huge pages by Aneesh Kumar for 64-bit
server processors. This allows the use of 16M pages as transparent
huge pages on kernels compiled with a 64K base page size.
- Base VFIO support for KVM on power by Alexey Kardashevskiy
- Wiring up of our nvram to the pstore infrastructure, including
putting compressed oopses in there by Aruna Balakrishnaiah
- Move, rework and improve our "EEH" (basically PCI error handling
and recovery) infrastructure. It is no longer specific to pseries
but is now usable by the new "powernv" platform as well (no
hypervisor) by Gavin Shan.
- I fixed some bugs in our math-emu instruction decoding and made it
usable to emulate some optional FP instructions on processors with
hard FP that lack them (such as fsqrt on Freescale embedded
processors).
- Support for Power8 "Event Based Branch" facility by Michael
Ellerman. This facility allows what is basically "userspace
interrupts" for performance monitor events.
- A bunch of Transactional Memory vs. Signals bug fixes and HW
breakpoint/watchpoint fixes by Michael Neuling.
And more ... I appologize in advance if I've failed to highlight
something that somebody deemed worth it."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
pstore: Add hsize argument in write_buf call of pstore_ftrace_call
powerpc/fsl: add MPIC timer wakeup support
powerpc/mpic: create mpic subsystem object
powerpc/mpic: add global timer support
powerpc/mpic: add irq_set_wake support
powerpc/85xx: enable coreint for all the 64bit boards
powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
powerpc/mpic: Add get_version API both for internal and external use
powerpc: Handle both new style and old style reserve maps
powerpc/hw_brk: Fix off by one error when validating DAWR region end
powerpc/pseries: Support compression of oops text via pstore
powerpc/pseries: Re-organise the oops compression code
pstore: Pass header size in the pstore write callback
powerpc/powernv: Fix iommu initialization again
powerpc/pseries: Inform the hypervisor we are using EBB regs
powerpc/perf: Add power8 EBB support
powerpc/perf: Core EBB support for 64-bit book3s
powerpc/perf: Drop MMCRA from thread_struct
powerpc/perf: Don't enable if we have zero events
...
On the x86 side, there are some optimizations and documentation updates.
The big ARM/KVM change for 3.11, support for AArch64, will come through
Catalin Marinas's tree. s390 and PPC have misc cleanups and bugfixes.
There is a conflict due to "s390/pgtable: fix ipte notify bit" having
entered 3.10 through Martin Schwidefsky's s390 tree. This pull request
has additional changes on top, so this tree's version is the correct one.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJR0oU6AAoJEBvWZb6bTYbynnsP/RSUrrHrA8Wu1tqVfAKu+1y5
6OIihqZ9x11/YMaNofAfv86jqxFu0/j7CzMGphNdjzujqKI+Q1tGe7oiVCmKzoG+
UvSctWsz0lpllgBtnnrm5tcfmG6rrddhLtpA7m320+xCVx8KV5P4VfyHZEU+Ho8h
ziPmb2mAQ65gBNX6nLHEJ3ITTgad6gt4NNbrKIYpyXuWZQJypzaRqT/vpc4md+Ed
dCebMXsL1xgyb98EcnOdrWH1wV30MfucR7IpObOhXnnMKeeltqAQPvaOlKzZh4dK
+QfxJfdRZVS0cepcxzx1Q2X3dgjoKQsHq1nlIyz3qu1vhtfaqBlixLZk0SguZ/R9
1S1YqucZiLRO57RD4q0Ak5oxwobu18ZoqJZ6nledNdWwDe8bz/W2wGAeVty19ky0
qstBdM9jnwXrc0qrVgZp3+s5dsx3NAm/KKZBoq4sXiDLd/yBzdEdWIVkIrU3X9wU
3X26wOmBxtsB7so/JR7ciTsQHelmLicnVeXohAEP9CjIJffB81xVXnXs0P0SYuiQ
RzbSCwjPzET4JBOaHWT0Dhv0DTS/EaI97KzlN32US3Bn3WiLlS1oDCoPFoaLqd2K
LxQMsXS8anAWxFvexfSuUpbJGPnKSidSQoQmJeMGBa9QhmZCht3IL16/Fb641ToN
xBohzi49L9FDbpOnTYfz
=1zpG
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"On the x86 side, there are some optimizations and documentation
updates. The big ARM/KVM change for 3.11, support for AArch64, will
come through Catalin Marinas's tree. s390 and PPC have misc cleanups
and bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
KVM: PPC: Ignore PIR writes
KVM: PPC: Book3S PR: Invalidate SLB entries properly
KVM: PPC: Book3S PR: Allow guest to use 1TB segments
KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
KVM: PPC: Book3S PR: Fix proto-VSID calculations
KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
KVM: Fix RTC interrupt coalescing tracking
kvm: Add a tracepoint write_tsc_offset
KVM: MMU: Inform users of mmio generation wraparound
KVM: MMU: document fast invalidate all mmio sptes
KVM: MMU: document fast invalidate all pages
KVM: MMU: document fast page fault
KVM: MMU: document mmio page fault
KVM: MMU: document write_flooding_count
KVM: MMU: document clear_spte_count
KVM: MMU: drop kvm_mmu_zap_mmio_sptes
KVM: MMU: init kvm generation close to mmio wrap-around value
KVM: MMU: add tracepoint for check_mmio_spte
KVM: MMU: fast invalidate all mmio sptes
...
This will be later used by powerpc THP support. In powerpc we want to use
pgtable for storing the hash index values. So instead of adding them to
mm_context list, we would like to store them in the second half of pmd
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Getting and Releasing the pgste lock has lock semantics. Make the
code an explicit barrier.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
In modify_prot_start we update the pgste value but never store it back
into the original location. Lets save the calculated result, since
modify_prot_commit will use the value of the pgste.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
When doing the transition invalid->valid in the host page table for
a guest, then the guest view of C/R is in the pgste. After validation
the view is pgste OR real key. We must zero out the real key C/R to
avoid guest over-indication for change (and reference).
Touching the real key is ok also for the host: The change bit is
tracked via write protection and the reference bit is also ok
because set_pte_at was called and the page will be touched anyway
soon. Furthermore architecture defines reference as "substantially
accurate", over- and underindication are ok.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
pte_present might return true on PAGE_TYPE_NONE, even if
the invalid bit is on. Modify the existing check of the
pgste functions to avoid crashes.
[ Martin Schwidefsky: added ptep_modify_prot_[start|commit] bits ]
Reported-by: Martin Schwidefky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CC: stable@vger.kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The guest prefix pages must be mapped writeable all the time
while SIE is running, otherwise the guest might see random
behaviour. (pinned at the pte level) Turns out that mlocking is
not enough, the page table entry (not the page) might change or
become r/o. This patch uses the gmap notifiers to kick guest
cpus out of SIE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
The RCP byte is a part of the PGSTE value, the existing RCP_xxx names
are inaccurate. As the defines describe bits and pieces of the PGSTE,
the names should start with PGSTE_. The KVM_UR_BIT and KVM_UC_BIT are
part of the PGSTE as well, give them better names as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Dont use the same bit as user referenced.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Add a notifier for kvm to get control before a page table entry is
invalidated. The notifier is only called for ptes of an address space
with pgstes that have been explicitly marked to require notification.
Kvm will use this to get control before prefix pages of virtual CPU
are unmapped.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Commit abf09bed3c ("s390/mm: implement software dirty bits")
introduced another difference in the pte layout vs. the pmd layout on
s390, thoroughly breaking the s390 support for hugetlbfs. This requires
replacing some more pte_xxx functions in mm/hugetlbfs.c with a
huge_pte_xxx version.
This patch introduces those huge_pte_xxx functions and their generic
implementation in asm-generic/hugetlb.h, which will now be included on
all architectures supporting hugetlbfs apart from s390. This change
will be a no-op for those architectures.
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz> [for !s390 parts]
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull s390 update from Martin Schwidefsky:
"This is the first batch of s390 patches for the 3.10 merge window.
Included are some performance enhancements: storage key
initialization, zero page cache synonyms, system call micro
optimization and the speedup patches for dasdfmt. Sebastian managed
to get rid of the special casing for the console device in the cio
layer. And the usual bunch of bug fixes."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (59 commits)
s390/pci: use pci_scan_root_bus
s390/scm_blk: fix memleak in init function
s390/scm_blk: allow more cluster size values
s390/cio: fix irq statistics
s390/memory hotplug: prevent offline of active memory increments
s390: remove small stack config option
s390: system call path micro optimization
s390: lowcore stack pointer offsets
s390/uapi: change struct statfs[64] member types to unsigned values
s390/pci: return correct dma address for offset > PAGE_SIZE
s390/ptrace: remove empty ifdefs
s390/compat: remove ptrace compat definitions from uapi header file
s390/compat: fix compile error for !COMPAT
s390/compat: fix compat_sys_statfs() memory corruption
s390/zcore: Fix HSA copy length for last block
s390/mm,gmap: segment mapping race
s390/mm,gmap: implement gmap_translate()
s390/pci: remove disable_device implementation
s390/pci: disable per default
s390/pci: return error after failed pci ops
...
Implement gmap_translate() function which translates a guest absolute address
to a user space process address without establishing the guest page table
entries.
This is useful for kvm guest address translations where no memory access
is expected to happen soon (e.g. tprot exception handler).
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Commit b4cbb197c7 ("vm: add vm_iomap_memory() helper function") added
a helper function wrapper around io_remap_pfn_range(), and every other
architecture defined it in <asm/pgtable.h>.
The s390 choice of <asm/io.h> may make sense, but is not very convenient
for this case, and gratuitous differences like that cause unexpected errors like this:
mm/memory.c: In function 'vm_iomap_memory':
mm/memory.c:2439:2: error: implicit declaration of function 'io_remap_pfn_range' [-Werror=implicit-function-declaration]
Glory be the kbuild test robot who noticed this, bisected it, and
reported it to the guilty parties (ie me).
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All architectures need to provide a check_pgt_cache() function. The s390 one
got lost somewhere.
So reintroduce it to prevent future compile errors e.g. if Thomas Gleixner's
idle loop rework patches get merged.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
When translating user space addresses to kernel addresses the follow_table()
function had two bugs:
- PROT_NONE mappings could be read accessed via the kernel mapping. That is
e.g. putting a filename into a user page, then protecting the page with
PROT_NONE and afterwards issuing the "open" syscall with a pointer to
the filename would incorrectly succeed.
- when walking the page tables it used the pgd/pud/pmd/pte primitives which
with dynamic page tables give no indication which real level of page tables
is being walked (region2, region3, segment or page table). So in case of an
exception the translation exception code passed to __handle_fault() is not
necessarily correct.
This is not really an issue since __handle_fault() doesn't evaluate the code.
Only in case of e.g. a SIGBUS this code gets passed to user space. If user
space can do something sane with the value is a different question though.
To fix these issues don't use any Linux primitives. Only walk the page tables
like the hardware would do it, however we leave quite some checks away since
we know that we only have full size page tables and each index is within bounds.
In theory this should fix all issues...
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The s390 architecture is unique in respect to dirty page detection,
it uses the change bit in the per-page storage key to track page
modifications. All other architectures track dirty bits by means
of page table entries. This property of s390 has caused numerous
problems in the past, e.g. see git commit ef5d437f71
"mm: fix XFS oops due to dirty pages without buffers on s390".
To avoid future issues in regard to per-page dirty bits convert
s390 to a fault based software dirty bit detection mechanism. All
user page table entries which are marked as clean will be hardware
read-only, even if the pte is supposed to be writable. A write by
the user process will trigger a protection fault which will cause
the user pte to be marked as dirty and the hardware read-only bit
is removed.
With this change the dirty bit in the storage key is irrelevant
for Linux as a host, but the storage key is still required for
KVM guests. The effect is that page_test_and_clear_dirty and the
related code can be removed. The referenced bit in the storage
key is still used by the page_test_and_clear_young primitive to
provide page age information.
For page cache pages of mappings with mapping_cap_account_dirty
there will not be any change in behavior as the dirty bit tracking
already uses read-only ptes to control the amount of dirty pages.
Only for swap cache pages and pages of mappings without
mapping_cap_account_dirty there can be additional protection faults.
To avoid an excessive number of additional faults the mk_pte
primitive checks for PageDirty if the pgprot value allows for writes
and pre-dirties the pte. That avoids all additional faults for
tmpfs and shmem pages until these pages are added to the swap cache.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Only needed to make some drivers compile...
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On s390, an architecture-specific implementation of the function
pmdp_set_wrprotect() is missing and the generic version is currently
being used. The generic version does not flush the tlb as it would be
needed on s390 when modifying an active pmd, which can lead to subtle
tlb errors on s390 when using transparent hugepages.
This patch adds an s390-specific implementation of pmdp_set_wrprotect()
including the missing tlb flush.
Cc: stable@vger.kernel.org
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The pfn calculation in pmd_pfn() is broken for thp, because it uses
HPAGE_SHIFT instead of the normal PAGE_SHIFT. This is fixed by removing
the distinction between thp and normal pmds in that function, and always
using PAGE_SHIFT.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull s390 update from Martin Schwidefsky:
"Add support to generate code for the latest machine zEC12, MOD and XOR
instruction support for the BPF jit compiler, the dasd safe offline
feature and the big one: the s390 architecture gets PCI support!!
Right before the world ends on the 21st ;-)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits)
s390/qdio: rename the misleading PCI flag of qdio devices
s390/pci: remove obsolete email addresses
s390/pci: speed up __iowrite64_copy by using pci store block insn
s390/pci: enable NEED_DMA_MAP_STATE
s390/pci: no msleep in potential IRQ context
s390/pci: fix potential NULL pointer dereference in dma_free_seg_table()
s390/pci: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
s390/bpf,jit: add support for XOR instruction
s390/bpf,jit: add support MOD instruction
s390/cio: fix pgid reserved check
vga: compile fix, disable vga for s390
s390/pci: add PCI Kconfig options
s390/pci: s390 specific PCI sysfs attributes
s390/pci: PCI hotplug support via SCLP
s390/pci: CHSC PCI support for error and availability events
s390/pci: DMA support
s390/pci: PCI adapter interrupts for MSI/MSI-X
s390/bitops: find leftmost bit instruction support
s390/pci: CLP interface
s390/pci: base support
...
We have two different implementation of is_zero_pfn() and my_zero_pfn()
helpers: for architectures with and without zero page coloring.
Let's consolidate them in <asm-generic/pgtable.h>.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just convert fault_init() to an early initcall. That's still early
enough since it only needs be called before user space processes get
executed. No reason to externalize it.
Also add the function to the init section and move the store_indication
variable to the read_mostly section.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use 2GB frames for indentity mapping if EDAT2 is
available to reduce TLB pressure.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Similar to pte_none() and pte_present(), the pmd functions should also
respect page protection of huge pages, especially PROT_NONE.
This patch also simplifies massage_pgprot_pmd() by adding new definitions
for huge page protection.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add a special module area on top of the vmalloc area, which may be only
used for modules and bpf jit generated code.
This makes sure that inter module branches will always happen without a
trampoline and in addition having all the code within a 2GB frame is
branch prediction unit friendly.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
pmd_huge() will always return 0 on !HUGETLBFS, however we use that helper
function when walking the kernel page tables to decide if we have a
1MB page frame or not.
Since we create 1MB frames for the kernel 1:1 mapping independently of
HUGETLBFS this can lead to incorrect storage accesses since the code
can assume that we have a pointer to a page table instead of a pointer
to a 1MB frame.
Fix this by adding a pmd_large() primitive like other architectures have
it already and remove all references to HUGETLBFS/HUGETLBPAGE from the
code that walks kernel page tables.
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The transparent huge page code passes a PMD pointer in as the third
argument of update_mmu_cache(), which expects a PTE pointer.
This never got noticed because X86 implements update_mmu_cache() as a
macro and thus we don't get any type checking, and X86 is the only
architecture which supports transparent huge pages currently.
Before other architectures can support transparent huge pages properly we
need to add a new interface which will take a PMD pointer as the third
argument rather than a PTE pointer.
[akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is part of the architecture backend for thp on s390. It
provides the pagetable pre-allocation functions
pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw(). Unlike
other archs, s390 has no struct page * as pgtable_t, but rather a pointer
to the page table. So instead of saving the pagetable pre- allocation
list info inside the struct page, it is being saved within the pagetable
itself.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is part of the architecture backend for thp on s390. It
provides the functions related to thp splitting, including serialization
against gup. Unlike other archs, pmdp_splitting_flush() cannot use a tlb
flushing operation to serialize against gup on s390, because that wouldn't
be stopped by the disabled IRQs. So instead, smp_call_function() is
called with an empty function, which will have the expected effect.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the file name from the comment at top of many files. In most
cases the file name was wrong anyway, so it's rather pointless.
Also unify the IBM copyright statement. We did have a lot of sightly
different statements and wanted to change them one after another
whenever a file gets touched. However that never happened. Instead
people start to take the old/"wrong" statements to use as a template
for new files.
So unify all of them in one go.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Replace __s390x__ with CONFIG_64BIT in all places that are not exported
to userspace or guarded with #ifdef __KERNEL__.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The kernel address space of a 64 bit kernel currently uses a three level
page table and the vmemmap array has a fixed address and a fixed maximum
size. A three level page table is good enough for systems with less than
3.8TB of memory, for bigger systems four page table levels need to be
used. Each page table level costs a bit of performance, use 3 levels for
normal systems and 4 levels only for the really big systems.
To avoid bloating sparse.o too much set MAX_PHYSMEM_BITS to 46 for a
maximum of 64TB of memory.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch makes sure we don't underindicate _PAGE_CHANGED in case
we have a race between an operation that changes the page and this
code path that hits us between page_get_storage_key and
page_set_storage_key. Note that we still have a potential
underindication on _PAGE_REFERENCED in the unlikely event that
the page was changed but not referenced _and_ someone references
the page in the race window. That's not considered to be a problem.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The pgste_update_all / pgste_update_young and pgste_set_pte need to
check if the pte entry contains a valid page address before the storage
key can be accessed. In addition pgste_set_pte needs to set the access
key and fetch protection bit of the new pte entry, not the old entry.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Linux on System z uses a ballooner based on diagnose 0x10. (aka as
collaborative memory management). This patch implements diagnose
0x10 on the guest address space.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
gmap_fault needs to walk the guest page table. However, parts of
that may change if some other thread does munmap. In that case
gmap_unmap_notifier will also unmap the corresponding parts from
the guest page table. We need to take mmap_sem in order to serialize
these operations.
do_exception now calls __gmap_fault with mmap_sem held which does
not get exported to modules. The exported function, which is called
from KVM, now takes mmap_sem.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
598841ca99 ([S390] use gmap address
spaces for kvm guest images) changed kvm to use a separate address
space for kvm guests. This address space was switched in __vcpu_run
In some cases (preemption, page fault) there is the possibility that
this address space switch is lost.
The typical symptom was a huge amount of validity intercepts or
random guest addressing exceptions.
Fix this by doing the switch in sie_loop and sie_exit and saving the
address space in the gmap structure itself. Also use the preempt
notifier.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Add code that allows KVM to control the virtual memory layout that
is seen by a guest. The guest address space uses a second page table
that shares the last level pte-tables with the process page table.
If a page is unmapped from the process page table it is automatically
unmapped from the guest page table as well.
The guest address space mapping starts out empty, KVM can map any
individual 1MB segments from the process virtual memory to any 1MB
aligned location in the guest virtual memory. If a target segment in
the process virtual memory does not exist or is unmapped while a
guest mapping exists the desired target address is stored as an
invalid segment table entry in the guest page table.
The population of the guest page table is fault driven.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>