2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

590 Commits

Author SHA1 Message Date
Ingo Molnar
090d717164 x86/boot/e820: Rename e820_mark_nosave_regions() to e820__register_nosave_regions()
This function is a minor misnomer: it is talking about 'marking' regions
as nosave - while the hibernation API is called register_nosave_region()
and the e820_mark_nosave_regions() is a wrapper around that functionality.

So name it to be in line with the API it is derived from.

( Rename e820_mark_nvs_memory() to e820__register_nvs_regions(), for similar
  reasons. )

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 22:55:26 +01:00
Ingo Molnar
1506c8dc94 x86/boot/e820: Rename e820_reserve_resources*() to e820__reserve_resources*()
Also do some minor cleanups.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 22:55:25 +01:00
Ingo Molnar
1a1270349a x86/boot/e820: Document e820__reserve_setup_data()
Also clean it up a bit.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 22:55:25 +01:00
Ingo Molnar
f9748fa045 x86/boot/e820: Simplify the e820__update_table() interface
The e820__update_table() parameters are pretty complex:

  arch/x86/include/asm/e820/api.h:extern int  e820__update_table(struct e820_entry *biosmap, int max_nr_map, u32 *pnr_map);

But 90% of the usage is trivial:

  arch/x86/kernel/e820.c:	if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries))
  arch/x86/kernel/e820.c:	e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
  arch/x86/kernel/e820.c:	e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
  arch/x86/kernel/e820.c:		if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries) < 0)
  arch/x86/kernel/e820.c:	e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
  arch/x86/kernel/early-quirks.c:	e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
  arch/x86/kernel/setup.c:	e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
  arch/x86/kernel/setup.c:		e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
  arch/x86/platform/efi/efi.c:	e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
  arch/x86/xen/setup.c:	e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
  arch/x86/xen/setup.c:	e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
  arch/x86/xen/setup.c:	e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),

as it only uses an exiting struct e820_table's entries array, its size and
its current number of entries as input and output arguments.

Only one use is non-trivial:

  arch/x86/kernel/e820.c:	e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);

... which call updates the E820 table in the zeropage in-situ, and the layout there does not
match that of 'struct e820_table' (in particular nr_entries is at a different offset,
hardcoded by the boot protocol).

Simplify all this by introducing a low level __e820__update_table() API that
the zeropage update call can use, and simplifying the main e820__update_table()
call signature down to:

	int e820__update_table(struct e820_table *table);

This visibly simplifies all the call sites:

  arch/x86/include/asm/e820/api.h:extern int  e820__update_table(struct e820_table *table);
  arch/x86/include/asm/e820/types.h: * call to e820__update_table() to remove duplicates.  The allowance
  arch/x86/kernel/e820.c: * The return value from e820__update_table() is zero if it
  arch/x86/kernel/e820.c:int __init e820__update_table(struct e820_table *table)
  arch/x86/kernel/e820.c:	if (e820__update_table(e820_table))
  arch/x86/kernel/e820.c:	e820__update_table(e820_table_firmware);
  arch/x86/kernel/e820.c:	e820__update_table(e820_table);
  arch/x86/kernel/e820.c:	e820__update_table(e820_table);
  arch/x86/kernel/e820.c:		if (e820__update_table(e820_table) < 0)
  arch/x86/kernel/early-quirks.c:	e820__update_table(e820_table);
  arch/x86/kernel/setup.c:	e820__update_table(e820_table);
  arch/x86/kernel/setup.c:		e820__update_table(e820_table);
  arch/x86/platform/efi/efi.c:	e820__update_table(e820_table);
  arch/x86/xen/setup.c:	e820__update_table(&xen_e820_table);
  arch/x86/xen/setup.c:	e820__update_table(e820_table);
  arch/x86/xen/setup.c:	e820__update_table(&xen_e820_table);

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 22:55:24 +01:00
Ingo Molnar
09821ff1d5 x86/boot/e820: Prefix the E820_* type names with "E820_TYPE_"
So there's a number of constants that start with "E820" but which
are not types - these create a confusing mixture when seen together
with 'enum e820_type' values:

	E820MAP
	E820NR
	E820_X_MAX
	E820MAX

To better differentiate the 'enum e820_type' values prefix them
with E820_TYPE_.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 22:55:22 +01:00
Ingo Molnar
be0c3f0fca x86/boot/e820: Rename e820_print_map() to e820__print_table()
All other table-level methods are already named 'table' in some way,
to change this one over to the (now consistent) nomenclature.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:32 +01:00
Ingo Molnar
ab6bc04cfd x86/boot/e820: Create coherent API function names for E820 range operations
We have these three related functions:

 extern void e820_add_region(u64 start, u64 size, int type);
 extern u64  e820_update_range(u64 start, u64 size, unsigned old_type, unsigned new_type);
 extern u64  e820_remove_range(u64 start, u64 size, unsigned old_type, int checktype);

But it's not clear from the naming that they are 3 operations based around the
same 'memory range' concept. Rename them to better signal this, and move
the prototypes next to each other:

 extern void e820__range_add   (u64 start, u64 size, int type);
 extern u64  e820__range_update(u64 start, u64 size, unsigned old_type, unsigned new_type);
 extern u64  e820__range_remove(u64 start, u64 size, unsigned old_type, int checktype);

Note that this improved organization of the functions shows another problem that was easy
to miss before: sometimes the E820 entry type is 'int', sometimes 'unsigned int' - but this
will be fixed in a separate patch.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:32 +01:00
Ingo Molnar
2df908baf5 x86/boot/e820: Rename e820_setup_gap() to e820__setup_pci_gap()
The e820_setup_gap() function name is unnecessarily silent about what
kind of gap it sets up. Make it clear that it's about the PCI gap.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:31 +01:00
Ingo Molnar
3bce64f019 x86/boot/e820: Rename e820_any_mapped()/e820_all_mapped() to e820__mapped_any()/e820__mapped_all()
The 'any' and 'all' are modified to the 'mapped' concept, so move them last in the name.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:31 +01:00
Ingo Molnar
f52355a99f x86/boot/e820: Rename sanitize_e820_table() to e820__update_table()
sanitize_e820_table() is a minor misnomer in that it suggests that
the E820 table requires sanitizing - which implies that it will only
do anything if the E820 table is irregular (not sane).

That is wrong, because sanitize_e820_table() also does a very regular
sorting of the E820 table, which is a necessity in the basic
append-only flow of E820 updates the kernel is allowed to perform to
it.

So rename it to e820__update_table() to include that purpose as well.

This also lines up all the table-update functions into a coherent
naming family:

  int  e820__update_table(struct e820_entry *biosmap, int max_nr_map, u32 *pnr_map);

  void e820__update_table_print(void);
  void e820__update_table_firmware(void);

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:31 +01:00
Ingo Molnar
5da217ca96 x86/boot/e820: Rename early_reserve_e820() to e820__memblock_alloc() and document it
early_reserve_e820() is an early hack for kexec that does a limited fixup of the
mptable and passes it to the kexec kernel as if it was the real thing.

For this it needs to allocate memory - but no memory allocator is available yet
beyond the memblock allocator, so early_reserve_e820() is really a wrapper
around memblock_alloc() plus a hack to update the e820_table_firmware entries.

The name 'reserve' is really a bit of a misnomer, as 'reserved' memory typically
means memory completely inaccessible to the kernel - while here what we want to do
is a special RAM allocation for our own purposes and insert that as RAM_RESERVED.

Rename the function to e820__memblock_alloc_reserved() to better signal this dual
purpose, plus document it better, which was omitted when it was merged. The barely
comprehensible and cryptic comment:

  /*
   * pre allocated 4k and reserved it in memblock and e820_table_firmware
   */
  u64 __init e820__memblock_alloc_reserved(u64 size, u64 align)

... does not count as documentation, replace it with:

  /*
   * Allocate the requested number of bytes with the requsted alignment
   * and return (the physical address) to the caller. Also register this
   * range in the 'firmware' E820 table.
   *
   * This allows kexec to fake a new mptable, as if it came from the real
   * system.
   */
  u64 __init e820__memblock_alloc_reserved(u64 size, u64 align)

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:30 +01:00
Ingo Molnar
9641bdafd8 x86/boot/e820: Clarify the role of finish_e820_parsing() and rename it to e820__finish_early_params()
finish_e820_parsing() is closely related to parse_early_params(), but the
name does not tell us this clearly, so rename it to e820__finish_early_params().

Also add a few comments to explain what the function does.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:29 +01:00
Ingo Molnar
da92139bff x86/boot/e820: Move e820_reserve_setup_data() to e820.c
The e820_reserve_setup_data() is local to arch/x86/kernel/setup.c,
but it is E820 functionality - so move it to e820.c to better
isolate E820 functionality.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:29 +01:00
Ingo Molnar
914053c08e x86/boot/e820: Rename parse_e820_ext() to e820__memory_setup_extended()
parse_e820_ext() is very similar to e820__memory_setup_default(), both are
taking bootloader provided data, add it to the E820 table and then
pass it sanitize_e820_table().

Rename it to e820__memory_setup_extended() to better signal their similar role.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:29 +01:00
Ingo Molnar
4918e2286d x86/boot/e820: Rename memblock_x86_fill() to e820__memblock_setup() and improve the explanations
So memblock_x86_fill() is another E820 code misnomer:

 - nothing in its name tells us that it's part of the E820 subsystem ...

 - The 'fill' wording is ambiguous and doesn't tell us whether it's a single
   entry or some process - while the _real_ purpose of the function is hidden,
   which is to do a complete setup of the (platform independent) memblock regions.

So rename it accordingly, to e820__memblock_setup().

Also translate this incomprehensible and misleading comment:

        /*
	 * EFI may have more than 128 entries
	 * We are safe to enable resizing, beause memblock_x86_fill()
	 * is rather later for x86
	 */
        memblock_allow_resize();

The worst aspect of this comment isn't even the sloppy typos, but that it
casually mentions a '128' number with no explanation, which makes one lead
to the assumption that this is related to the well-known limit of a maximum
of 128 E820 entries passed via legacy bootloaders.

But no, the _real_ meaning of 128 here is that of the memblock subsystem,
which too happens to have a 128 entries limit for very early memblock
regions (which is unrelated to E820), via INIT_MEMBLOCK_REGIONS ...

So change the comment to a more comprehensible version:

        /*
         * The bootstrap memblock region count maximum is 128 entries
         * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
         * than that - so allow memblock resizing.
         *
         * This is safe, because this call happens pretty late during x86 setup,
         * so we know about reserved memory regions already. (This is important
         * so that memblock resizing does no stomp over reserved areas.)
         */
        memblock_allow_resize();

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:27 +01:00
Ingo Molnar
544a0f47e7 x86/boot/e820: Rename e820_table_saved to e820_table_firmware and improve the description
So the 'e820_table_saved' is a bit of a misnomer that hides its real purpose.

At first sight the name suggests that it's some sort save/restore mechanism,
as this is how we typically name such facilities in the kernel.

But that is not so, e820_table_saved is the original firmware version of the
e820 table, not modified by the kernel. This table is displayed in the
/sys/firmware/memmap file, and it's also used by the hibernation code to
calculate a physical memory layout MD5 fingerprint checksum which is
invariant of the kernel.

So rename it to 'e820_table_firmware' and update all the comments to better
describe the main e820 data strutures.

Also rename:

  'initial_e820_table_saved'  =>  'e820_table_firmware_init'
  'e820_update_range_saved'   =>  'e820_update_range_firmware'

... to better match the new nomenclature.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:27 +01:00
Ingo Molnar
103e206309 x86/boot/e820: Rename default_machine_specific_memory_setup() to e820__memory_setup_default()
The default_machine_specific_memory_setup() is a mouthful and despite the
many words it doesn't actually tell us clearly what it does.

The function is the x86 legacy memory layout setup code, based on
E820-formatted memory layout information passed by the bootloader
via the boot_params.

Rename it to e820__memory_setup_default() to better signal its purpose.

Also rename the related higher level function to be consistent with
this new naming:

    setup_memory_map() => e820__memory_setup()

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:26 +01:00
Ingo Molnar
bf495573fa x86/boot/e820: Harmonize the 'struct e820_table' fields
So the e820_table->map and e820_table->nr_map names are a bit
confusing, because it's not clear what a 'map' really means
(it could be a bitmap, or some other data structure), nor is
it clear what nr_map means (is it a current index, or some
other count).

Rename the fields from:

 e820_table->map        =>     e820_table->entries
 e820_table->nr_map     =>     e820_table->nr_entries

which makes it abundantly clear that these are entries
of the table, and that the size of the table is ->nr_entries.

Propagate the changes to all affected files. Where necessary,
adjust local variable names to better reflect the new field names.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 09:33:16 +01:00
Ingo Molnar
61a5010163 x86/boot/e820: Rename everything to e820_table
No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 09:33:16 +01:00
Ingo Molnar
acd4c04872 x86/boot/e820: Rename 'e820_map' variables to 'e820_array'
In line with the rename to 'struct e820_array', harmonize the naming of common e820
table variable names as well:

 e820          =>  e820_array
 e820_saved    =>  e820_array_saved
 e820_map      =>  e820_array
 initial_e820  =>  e820_array_init

This makes the variable names more consistent  and easier to grep for.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 09:33:15 +01:00
Ingo Molnar
8ec67d97bf x86/boot/e820: Rename the basic e820 data types to 'struct e820_entry' and 'struct e820_array'
The 'e820entry' and 'e820map' names have various annoyances:

 - the missing underscore departs from the usual kernel style
   and makes the code look weird,

 - in the past I kept confusing the 'map' with the 'entry', because
   a 'map' is ambiguous in that regard,

 - it's not really clear from the 'e820map' that this is a regular
   C array.

Rename them to 'struct e820_entry' and 'struct e820_array' accordingly.

( Leave the legacy UAPI header alone but do the rename in the bootparam.h
  and e820/types.h file - outside tools relying on these defines should
  either adjust their code, or should use the legacy header, or should
  create their private copies for the definitions. )

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 09:33:14 +01:00
Ingo Molnar
66441bd3cf x86/boot/e820: Move asm/e820.h to asm/e820/api.h
In line with asm/e820/types.h, move the e820 API declarations to
asm/e820/api.h and update all usage sites.

This is just a mechanical, obviously correct move & replace patch,
there will be subsequent changes to clean up the code and to make
better use of the new header organization.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 09:31:13 +01:00
Xunlei Pang
a8d4c8246b x86/crash: Update the stale comment in reserve_crashkernel()
CRASH_KERNEL_ADDR_MAX has been missing for a long time,
update it with a more detailed explanation.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Young <dyoung@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert LeBlanc <robert@leblancnet.us>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/1485154103-18426-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-23 08:57:55 +01:00
Reza Arbab
39fa104d5b mm: remove x86-only restriction of movable_node
In commit c5320926e3 ("mem-hotplug: introduce movable_node boot
option"), the memblock allocation direction is changed to bottom-up and
then back to top-down like this:

1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
2. memblock_set_bottom_up(false), called by x86's numa_init().

Even though (1) occurs in generic mm code, it is wrapped by #ifdef
CONFIG_MOVABLE_NODE, which depends on X86_64.

This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
things will be unbalanced.  (1) will happen for them, but (2) will not.

This toggle was added in the first place because x86 has a delay between
adding memblocks and marking them as hotpluggable.  Since other arches
do this marking either immediately or not at all, they do not require
the bottom-up toggle.

So, resolve things by moving (1) from cmdline_parse_movable_node() to
x86's setup_arch(), immediately after the movable_node parameter has
been parsed.

Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alistair Popple <apopple@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Thomas Gleixner
1e90a13d0c x86/smpboot: Init apic mapping before usage
The recent changes, which forced the registration of the boot cpu on UP
systems, which do not have ACPI tables, have been fixed for systems w/o
local APIC, but left a wreckage for systems which have neither ACPI nor
mptables, but the CPU has an APIC, e.g. virtualbox.

The boot process crashes in prefill_possible_map() as it wants to register
the boot cpu, which needs to access the local apic, but the local APIC is
not yet mapped.

There is no reason why init_apic_mapping() can't be invoked before
prefill_possible_map(). So instead of playing another silly early mapping
game, as the ACPI/mptables code does, we just move init_apic_mapping()
before the call to prefill_possible_map().

In hindsight, I should have noticed that combination earlier.

Sorry for the churn (also in stable)!

Fixes: ff8560512b ("x86/boot/smp: Don't try to poke disabled/non-existent APIC")
Reported-and-debugged-by: Michal Necasek <michal.necasek@oracle.com>
Reported-and-tested-by: Wolfgang Bauer <wbauer@tmo.at>
Cc: prarit@redhat.com
Cc: ville.syrjala@linux.intel.com
Cc: michael.thayer@oracle.com
Cc: knut.osmundsen@oracle.com
Cc: frank.mehnert@oracle.com
Cc: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610282114380.5053@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-29 14:00:46 +02:00
Linus Torvalds
3ef0a61a46 Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
 "The changes in this cycle were:

   - Save e820 table RAM footprint on larger kernel configurations.
     (Denys Vlasenko)

   - pmem related fixes (Dan Williams)

   - theoretical e820 boundary condition fix (Wei Yang)"

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Fix kdump, cleanup aborted E820_PRAM max_pfn manipulation
  x86/e820: Use much less memory for e820/e820_saved, save up to 120k
  x86/e820: Prepare e280 code for switch to dynamic storage
  x86/e820: Mark some static functions __init
  x86/e820: Fix very large 'size' handling boundary condition
2016-10-03 16:46:53 -07:00
Linus Torvalds
1a4a2bc460 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull low-level x86 updates from Ingo Molnar:
 "In this cycle this topic tree has become one of those 'super topics'
  that accumulated a lot of changes:

   - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
     x86 - preceded by an array of changes. v4.8 saw preparatory changes
     in this area already - this is the rest of the work. Includes the
     thread stack caching performance optimization. (Andy Lutomirski)

   - switch_to() cleanups and all around enhancements. (Brian Gerst)

   - A large number of dumpstack infrastructure enhancements and an
     unwinder abstraction. The secret long term plan is safe(r) live
     patching plus maybe another attempt at debuginfo based unwinding -
     but all these current bits are standalone enhancements in a frame
     pointer based debug environment as well. (Josh Poimboeuf)

   - More __ro_after_init and const annotations. (Kees Cook)

   - Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

[ The virtually mapped stack changes are pretty fundamental, and not
  x86-specific per se, even if they are only used on x86 right now. ]

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  x86/asm: Get rid of __read_cr4_safe()
  thread_info: Use unsigned long for flags
  x86/alternatives: Add stack frame dependency to alternative_call_2()
  x86/dumpstack: Fix show_stack() task pointer regression
  x86/dumpstack: Remove dump_trace() and related callbacks
  x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
  oprofile/x86: Convert x86_backtrace() to use the new unwinder
  x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
  perf/x86: Convert perf_callchain_kernel() to use the new unwinder
  x86/unwind: Add new unwind interface and implementations
  x86/dumpstack: Remove NULL task pointer convention
  fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
  sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  lib/syscall: Pin the task stack in collect_syscall()
  x86/process: Pin the target stack in get_wchan()
  x86/dumpstack: Pin the target stack when dumping it
  kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
  sched/core: Add try_get_task_stack() and put_task_stack()
  x86/entry/64: Fix a minor comment rebase error
  iommu/amd: Don't put completion-wait semaphore on stack
  ...
2016-10-03 16:13:28 -07:00
Linus Torvalds
110a9e42b6 Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Ingo Molnar:
 "The main changes are:

   - Persistent CPU/node numbering across CPU hotplug/unplug events.
     This is a pretty involved series of changes that first fetches all
     the information during bootup and then uses it for the various
     hotplug/unplug methods. (Gu Zheng, Dou Liyang)

   - IO-APIC hot-add/remove fixes and enhancements. (Rui Wang)

   - ... various fixes, cleanups and enhancements"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86/apic: Fix silent & fatal merge conflict in __generic_processor_info()
  acpi: Fix broken error check in map_processor()
  acpi: Validate processor id when mapping the processor
  acpi: Provide mechanism to validate processors in the ACPI tables
  x86/acpi: Set persistent cpuid <-> nodeid mapping when booting
  x86/acpi: Enable MADT APIs to return disabled apicids
  x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping
  x86/acpi: Enable acpi to register all possible cpus at boot time
  x86/numa: Online memory-less nodes at boot time
  x86/apic: Get rid of apic_version[] array
  x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq()
  x86/ioapic: Ignore root bridges without a companion ACPI device
  x86/apic: Update comment about disabling processor focus
  x86/smpboot: Check APIC ID before setting up default routing
  x86/ioapic: Fix IOAPIC failing to request resource
  x86/ioapic: Fix lost IOAPIC resource after hot-removal and hotadd
  x86/ioapic: Fix setup_res() failing to get resource
  x86/ioapic: Support hot-removal of IOAPICs present during boot
  x86/ioapic: Change prototype of acpi_ioapic_add()
  x86/apic, ACPI: Fix incorrect assignment when handling apic/x2apic entries
  ...
2016-10-03 15:36:06 -07:00
Linus Torvalds
de956b8f45 Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
 "Main changes in this cycle were:

   - Refactor the EFI memory map code into architecture neutral files
     and allow drivers to permanently reserve EFI boot services regions
     on x86, as well as ARM/arm64. (Matt Fleming)

   - Add ARM support for the EFI ESRT driver. (Ard Biesheuvel)

   - Make the EFI runtime services and efivar API interruptible by
     swapping spinlocks for semaphores. (Sylvain Chouleur)

   - Provide the EFI identity mapping for kexec which allows kexec to
     work on SGI/UV platforms with requiring the "noefi" kernel command
     line parameter. (Alex Thorlton)

   - Add debugfs node to dump EFI page tables on arm64. (Ard Biesheuvel)

   - Merge the EFI test driver being carried out of tree until now in
     the FWTS project. (Ivan Hu)

   - Expand the list of flags for classifying EFI regions as "RAM" on
     arm64 so we align with the UEFI spec. (Ard Biesheuvel)

   - Optimise out the EFI mixed mode if it's unsupported (CONFIG_X86_32)
     or disabled (CONFIG_EFI_MIXED=n) and switch the early EFI boot
     services function table for direct calls, alleviating us from
     having to maintain the custom function table. (Lukas Wunner)

   - Miscellaneous cleanups and fixes"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  x86/efi: Round EFI memmap reservations to EFI_PAGE_SIZE
  x86/efi: Allow invocation of arbitrary boot services
  x86/efi: Optimize away setup_gop32/64 if unused
  x86/efi: Use kmalloc_array() in efi_call_phys_prolog()
  efi/arm64: Treat regions with WT/WC set but WB cleared as memory
  efi: Add efi_test driver for exporting UEFI runtime service interfaces
  x86/efi: Defer efi_esrt_init until after memblock_x86_fill
  efi/arm64: Add debugfs node to dump UEFI runtime page tables
  x86/efi: Remove unused find_bits() function
  fs/efivarfs: Fix double kfree() in error path
  x86/efi: Map in physical addresses in efi_map_region_fixed
  lib/ucs2_string: Speed up ucs2_utf8size()
  firmware-gsmi: Delete an unnecessary check before the function call "dma_pool_destroy"
  x86/efi: Initialize status to ensure garbage is not returned on small size
  efi: Replace runtime services spinlock with semaphore
  efi: Don't use spinlocks for efi vars
  efi: Use a file local lock for efivars
  efi/arm*: esrt: Add missing call to efi_esrt_init()
  efi/esrt: Use memremap not ioremap to access ESRT table in memory
  x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data
  ...
2016-10-03 11:33:18 -07:00
Andy Lutomirski
1ef55be16e x86/asm: Get rid of __read_cr4_safe()
We use __read_cr4() vs __read_cr4_safe() inconsistently.  On
CR4-less CPUs, all CR4 bits are effectively clear, so we can make
the code simpler and more robust by making __read_cr4() always fix
up faults on 32-bit kernels.

This may fix some bugs on old 486-like CPUs, but I don't have any
easy way to test that.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: david@saggiorato.net
Link: http://lkml.kernel.org/r/ea647033d357d9ce2ad2bbde5a631045f5052fb6.1475178370.git.luto@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-30 12:40:12 +02:00
Thomas Gleixner
d7e25c66c9 Merge branch 'x86/urgent' into x86/asm
Get the cr4 fixes so we can apply the final cleanup
2016-09-30 12:38:28 +02:00
Andy Lutomirski
192d1dccbf x86/boot: Fix another __read_cr4() case on 486
The condition for reading CR4 was wrong: there are some CPUs with
CPUID but not CR4.  Rather than trying to make the condition exact,
use __read_cr4_safe().

Fixes: 18bc7bd523 ("x86/boot: Synchronize trampoline_cr4_features and mmu_cr4_features directly")
Reported-by: david@saggiorato.net
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Link: http://lkml.kernel.org/r/8c453a61c4f44ab6ff43c29780ba04835234d2e5.1475178369.git.luto@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-30 12:37:40 +02:00
Denys Vlasenko
475339684e x86/e820: Prepare e280 code for switch to dynamic storage
This patch turns e820 and e820_saved into pointers to e820 tables,
of the same size as before.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20160917213927.1787-2-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-21 15:02:12 +02:00
Ricardo Neri
3dad6f7f69 x86/efi: Defer efi_esrt_init until after memblock_x86_fill
Commit 7b02d53e7852 ("efi: Allow drivers to reserve boot services forever")
introduced a new efi_mem_reserve to reserve the boot services memory
regions forever. This reservation involves allocating a new EFI memory
range descriptor. However, allocation can only succeed if there is memory
available for the allocation. Otherwise, error such as the following may
occur:

esrt: Reserving ESRT space from 0x000000003dd6a000 to 0x000000003dd6a010.
Kernel panic - not syncing: ERROR: Failed to allocate 0x9f0 bytes below \
 0x0.
CPU: 0 PID: 0 Comm: swapper Not tainted 4.7.0-rc5+ #503
 0000000000000000 ffffffff81e03ce0 ffffffff8131dae8 ffffffff81bb6c50
 ffffffff81e03d70 ffffffff81e03d60 ffffffff8111f4df 0000000000000018
 ffffffff81e03d70 ffffffff81e03d08 00000000000009f0 00000000000009f0
Call Trace:
 [<ffffffff8131dae8>] dump_stack+0x4d/0x65
 [<ffffffff8111f4df>] panic+0xc5/0x206
 [<ffffffff81f7c6d3>] memblock_alloc_base+0x29/0x2e
 [<ffffffff81f7c6e3>] memblock_alloc+0xb/0xd
 [<ffffffff81f6c86d>] efi_arch_mem_reserve+0xbc/0x134
 [<ffffffff81fa3280>] efi_mem_reserve+0x2c/0x31
 [<ffffffff81fa3280>] ? efi_mem_reserve+0x2c/0x31
 [<ffffffff81fa40d3>] efi_esrt_init+0x19e/0x1b4
 [<ffffffff81f6d2dd>] efi_init+0x398/0x44a
 [<ffffffff81f5c782>] setup_arch+0x415/0xc30
 [<ffffffff81f55af1>] start_kernel+0x5b/0x3ef
 [<ffffffff81f55434>] x86_64_start_reservations+0x2f/0x31
 [<ffffffff81f55520>] x86_64_start_kernel+0xea/0xed
---[ end Kernel panic - not syncing: ERROR: Failed to allocate 0x9f0
     bytes below 0x0.

An inspection of the memblock configuration reveals that there is no memory
available for the allocation:

MEMBLOCK configuration:
 memory size = 0x0 reserved size = 0x4f339c0
 memory.cnt  = 0x1
 memory[0x0]    [0x00000000000000-0xffffffffffffffff], 0x0 bytes on node 0\
                 flags: 0x0
 reserved.cnt  = 0x4
 reserved[0x0]  [0x0000000008c000-0x0000000008c9bf], 0x9c0 bytes flags: 0x0
 reserved[0x1]  [0x0000000009f000-0x000000000fffff], 0x61000 bytes\
                 flags: 0x0
 reserved[0x2]  [0x00000002800000-0x0000000394bfff], 0x114c000 bytes\
                 flags: 0x0
 reserved[0x3]  [0x000000304e4000-0x00000034269fff], 0x3d86000 bytes\
                 flags: 0x0

This situation can be avoided if we call efi_esrt_init after memblock has
memory regions for the allocation.

Also, the EFI ESRT driver makes use of early_memremap'pings. Therfore, we
do not want to defer efi_esrt_init for too long. We must call such function
while calls to early_memremap are still valid.

A good place to meet the two aforementioned conditions is right after
memblock_x86_fill, grouped with other EFI-related functions.

Reported-by: Scott Lawson <scott.lawson@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Peter Jones <pjones@redhat.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:08:52 +01:00
Matt Fleming
4971531af3 x86/efi: Test for EFI_MEMMAP functionality when iterating EFI memmap
Both efi_find_mirror() and efi_fake_memmap() really want to know
whether the EFI memory map is available, not just whether the machine
was booted using EFI. efi_fake_memmap() even has a check for
EFI_MEMMAP at the start of the function.

Since we've already got other code that has this dependency, merge
everything under one if() conditional, and remove the now superfluous
check from efi_fake_memmap().

Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:06:34 +01:00
Ingo Molnar
2b3061c77c Merge branch 'x86/mm' into x86/asm, to unify the two branches for simplicity
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:41:52 +02:00
Baoquan He
a91bf718db x86/mm/numa: Open code function early_get_boot_cpu_id()
Previously early_acpi_boot_init() was called in early_get_boot_cpu_id()
to get the value for boot_cpu_physical_apicid. Now early_acpi_boot_init()
has been taken out and moved to setup_arch(), the name of
early_get_boot_cpu_id() doesn't match its implementation anymore, and
only the getting boot-time SMP configuration code was left.

So in this patch we open code it.

Also move the smp_found_config check into default_get_smp_config to
simplify code, because both early_get_smp_config() and get_smp_config()
call x86_init.mpparse.get_smp_config().

Also remove the redundent CONFIG_X86_MPPARSE #ifdef check when we call
early_get_smp_config().

Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-acpi@vger.kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1470985033-22493-1-git-send-email-bhe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-15 08:51:54 +02:00
Andy Lutomirski
d0de0f685d x86/boot: Defer setup_real_mode() to early_initcall time
There's no need to run setup_real_mode() as early as we run it.
Defer it to the same early_initcall that sets up the page
permissions for the real mode code.

This should be a code size reduction.  More importantly, it give us
a longer window in which we can allocate the real mode trampoline.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/fd62f0da4f79357695e9bf3e365623736b05f119.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11 11:15:00 +02:00
Andy Lutomirski
18bc7bd523 x86/boot: Synchronize trampoline_cr4_features and mmu_cr4_features directly
The initialization process for trampoline_cr4_features and
mmu_cr4_features was confusing.  The intent is for mmu_cr4_features
and *trampoline_cr4_features to stay in sync, but
trampoline_cr4_features is NULL until setup_real_mode() runs.  The
old code synchronized *trampoline_cr4_features *twice*, once in
setup_real_mode() and once in setup_arch().  It also initialized
mmu_cr4_features in setup_real_mode(), which causes the actual value
of mmu_cr4_features to potentially depend on when setup_real_mode()
is called.

With this patch, mmu_cr4_features is initialized directly in
setup_arch(), and *trampoline_cr4_features is synchronized to
mmu_cr4_features when the trampoline is set up.

After this patch, it should be safe to defer setup_real_mode().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/d48a263f9912389b957dd495a7127b009259ffe0.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11 11:15:00 +02:00
Andy Lutomirski
007b756053 x86/boot: Run reserve_bios_regions() after we initialize the memory map
reserve_bios_regions() is a quirk that reserves memory that we might
otherwise think is available.  There's no need to run it so early,
and running it before we have the memory map initialized with its
non-quirky inputs makes it hard to make reserve_bios_regions() more
intelligent.

Move it right after we populate the memblock state.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Matt Fleming <mfleming@suse.de>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/59f58618911005c799c6c9979ce6ae4881d907c2.1470821230.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11 11:14:59 +02:00
Kees Cook
404f6aac9b x86: Apply more __ro_after_init and const
Guided by grsecurity's analogous __read_only markings in arch/x86,
this applies several uses of __ro_after_init to structures that are
only updated during __init, and const for some structures that are
never updated.  Additionally extends __init markings to some functions
that are only used during __init, and cleans up some missing C99 style
static initializers.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Brown <david.brown@linaro.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20160808232906.GA29731@www.outflux.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:55:05 +02:00
Thomas Garnier
c7d2361f75 x86/mm/KASLR: Fix physical memory calculation on KASLR memory randomization
Initialize KASLR memory randomization after max_pfn is initialized. Also
ensure the size is rounded up. It could create problems on machines
with more than 1Tb of memory on certain random addresses.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: kernel-hardening@lists.openwall.com
Fixes: 021182e52f ("Enable KASLR for physical mapping memory regions")
Link: http://lkml.kernel.org/r/1470762665-88032-1-git-send-email-thgarnie@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:45:19 +02:00
Linus Torvalds
aeb35d6b74 Merge branch 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 header cleanups from Ingo Molnar:
 "This tree is a cleanup of the x86 tree reducing spurious uses of
  module.h - which should improve build performance a bit"

* 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, crypto: Restore MODULE_LICENSE() to glue_helper.c so it loads
  x86/apic: Remove duplicated include from probe_64.c
  x86/ce4100: Remove duplicated include from ce4100.c
  x86/headers: Include spinlock_types.h in x8664_ksyms_64.c for missing spinlock_t
  x86/platform: Delete extraneous MODULE_* tags fromm ts5500
  x86: Audit and remove any remaining unnecessary uses of module.h
  x86/kvm: Audit and remove any unnecessary uses of module.h
  x86/xen: Audit and remove any unnecessary uses of module.h
  x86/platform: Audit and remove any unnecessary uses of module.h
  x86/lib: Audit and remove any unnecessary uses of module.h
  x86/kernel: Audit and remove any unnecessary uses of module.h
  x86/mm: Audit and remove any unnecessary uses of module.h
  x86: Don't use module.h just for AUTHOR / LICENSE tags
2016-08-01 14:23:42 -04:00
Linus Torvalds
e663107fa1 ACPI material for v4.8-rc1
- Support for ACPI SSDT overlays allowing Secondary System
    Description Tables (SSDTs) to be loaded at any time from EFI
    variables or via configfs (Octavian Purdila, Mika Westerberg).
 
  - Support for the ACPI LPI (Low-Power Idle) feature introduced in
    ACPI 6.0 and allowing processor idle states to be represented in
    ACPI tables in a hierarchical way (with the help of Processor
    Container objects) and support for ACPI idle states management
    on ARM64, based on LPI (Sudeep Holla).
 
  - General improvements of ACPI support for NUMA and ARM64 support
    for ACPI-based NUMA (Hanjun Guo, David Daney, Robert Richter).
 
  - General improvements of the ACPI table upgrade mechanism and
    ARM64 support for that feature (Aleksey Makarov, Jon Masters).
 
  - Support for the Boot Error Record Table (BERT) in APEI and
    improvements of kernel messages printed by the error injection
    code (Huang Ying, Borislav Petkov).
 
  - New driver for the Intel Broxton WhiskeyCove PMIC operation
    region and support for the REGS operation region on Broxton,
    PMIC code cleanups (Bin Gao, Felipe Balbi, Paul Gortmaker).
 
  - New driver for the power participant device which is part of the
    Dynamic Power and Thermal Framework (DPTF) and DPTF-related code
    reorganization (Srinivas Pandruvada).
 
  - Support for the platform-initiated graceful shutdown feature
    introduced in ACPI 6.1 (Prashanth Prakash).
 
  - ACPI button driver update related to lid input events generated
    automatically on initialization and system resume that have been
    problematic for some time (Lv Zheng).
 
  - ACPI EC driver cleanups (Lv Zheng).
 
  - Documentation of the ACPICA release automation process and the
    in-kernel ACPI AML debugger (Lv Zheng).
 
  - New blacklist entry and two fixes for the ACPI backlight driver
    (Alex Hung, Arvind Yadav, Ralf Gerbig).
 
  - Cleanups of the ACPI pci_slot driver (Joe Perches, Paul Gortmaker).
 
  - ACPI CPPC code changes to make it more robust against possible
    defects in ACPI tables and new symbol definitions for PCC (Hoan
    Tran).
 
  - System reboot code modification to execute the ACPI _PTS (Prepare
    To Sleep) method in addition to _TTS (Ocean He).
 
  - ACPICA-related change to carry out lock ordering checks in ACPICA
    if ACPICA debug is enabled in the kernel (Lv Zheng).
 
  - Assorted minor fixes and cleanups (Andy Shevchenko, Baoquan He,
    Bhaktipriya Shridhar, Paul Gortmaker, Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJXl8A7AAoJEILEb/54YlRxF0kQAI6mH0yan60Osu4598+VNvgv
 wxOWl1TEbKd+LaJkofRZ+FPzZkQf5c/h/8Oo8Q3LEpFhjkARhhX7ThDjS5v2Nx6v
 I/icQ64ynPUPrw6hGNVrmec9ofZjiAs3j3Rt2bEiae+YN6guvfhWE+kBCHo2G/nN
 o4BSaxYjkphUTDSi4/5BfaocV2sl3apvwjtAj8zgGn4RD81bFFLnblynHkqJVcoN
 HAfm7QTVjT01Zkv565OSZgK8CFcD8Ky2KKKBQvcIW8zQmD6IXaoTHSYSwL0SH+oK
 bxUZUmWVfFWw4kDTAY9mw0QwtWz9ODTWh/WMhs3itWRRN5qHfogs99rCVYFtFufQ
 ODVy4wpt4wmpzZVhyUDTTigAhznPAtCam6EpL1YeNbtyrRN4evvZVFHBZJnmhosX
 zI9iLF4eqdnJZKvh+L1VFU+py8aAZpz1ZEOatNMI+xdhArbGm7v89cldzaRkJhuW
 LZr+JqYQGaOZS5qSnymwJL1KfF66+2QGpzdvzJN5FNIDACoqanATbZ/Iie2ENcM+
 WwCEWrGJFDmM30raBNNcvx0yHFtVkcNbOymla4paVg7i29nu88Ynw4Z6seIIP11C
 DryzLFhw+3jdTg2zK/te/wkhciJ0F+iZjo6VXywSMnwatf36bpdp4r4JLUVfEo2t
 8DOGKyFMLYY1zOPMK9Th
 =YwbM
 -----END PGP SIGNATURE-----

Merge tag 'acpi-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI updates from Rafael Wysocki:
 "The new feaures here are the support for ACPI overlays (allowing ACPI
  tables to be loaded at any time from EFI variables or via configfs)
  and the LPI (Low-Power Idle) support.  Also notable is the ACPI-based
  NUMA support for ARM64.

  Apart from that we have two new drivers, for the DPTF (Dynamic Power
  and Thermal Framework) power participant device and for the Intel
  Broxton WhiskeyCove PMIC, some more PMIC-related changes, support for
  the Boot Error Record Table (BERT) in APEI and support for
  platform-initiated graceful shutdown.

  Plus two new pieces of documentation and usual assorted fixes and
  cleanups in quite a few places.

  Specifics:

   - Support for ACPI SSDT overlays allowing Secondary System
     Description Tables (SSDTs) to be loaded at any time from EFI
     variables or via configfs (Octavian Purdila, Mika Westerberg).

   - Support for the ACPI LPI (Low-Power Idle) feature introduced in
     ACPI 6.0 and allowing processor idle states to be represented in
     ACPI tables in a hierarchical way (with the help of Processor
     Container objects) and support for ACPI idle states management on
     ARM64, based on LPI (Sudeep Holla).

   - General improvements of ACPI support for NUMA and ARM64 support for
     ACPI-based NUMA (Hanjun Guo, David Daney, Robert Richter).

   - General improvements of the ACPI table upgrade mechanism and ARM64
     support for that feature (Aleksey Makarov, Jon Masters).

   - Support for the Boot Error Record Table (BERT) in APEI and
     improvements of kernel messages printed by the error injection code
     (Huang Ying, Borislav Petkov).

   - New driver for the Intel Broxton WhiskeyCove PMIC operation region
     and support for the REGS operation region on Broxton, PMIC code
     cleanups (Bin Gao, Felipe Balbi, Paul Gortmaker).

   - New driver for the power participant device which is part of the
     Dynamic Power and Thermal Framework (DPTF) and DPTF-related code
     reorganization (Srinivas Pandruvada).

   - Support for the platform-initiated graceful shutdown feature
     introduced in ACPI 6.1 (Prashanth Prakash).

   - ACPI button driver update related to lid input events generated
     automatically on initialization and system resume that have been
     problematic for some time (Lv Zheng).

   - ACPI EC driver cleanups (Lv Zheng).

   - Documentation of the ACPICA release automation process and the
     in-kernel ACPI AML debugger (Lv Zheng).

   - New blacklist entry and two fixes for the ACPI backlight driver
     (Alex Hung, Arvind Yadav, Ralf Gerbig).

   - Cleanups of the ACPI pci_slot driver (Joe Perches, Paul Gortmaker).

   - ACPI CPPC code changes to make it more robust against possible
     defects in ACPI tables and new symbol definitions for PCC (Hoan
     Tran).

   - System reboot code modification to execute the ACPI _PTS (Prepare
     To Sleep) method in addition to _TTS (Ocean He).

   - ACPICA-related change to carry out lock ordering checks in ACPICA
     if ACPICA debug is enabled in the kernel (Lv Zheng).

   - Assorted minor fixes and cleanups (Andy Shevchenko, Baoquan He,
     Bhaktipriya Shridhar, Paul Gortmaker, Rafael Wysocki)"

* tag 'acpi-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits)
  ACPI: enable ACPI_PROCESSOR_IDLE on ARM64
  arm64: add support for ACPI Low Power Idle(LPI)
  drivers: firmware: psci: initialise idle states using ACPI LPI
  cpuidle: introduce CPU_PM_CPU_IDLE_ENTER macro for ARM{32, 64}
  arm64: cpuidle: drop __init section marker to arm_cpuidle_init
  ACPI / processor_idle: Add support for Low Power Idle(LPI) states
  ACPI / processor_idle: introduce ACPI_PROCESSOR_CSTATE
  ACPI / DPTF: move int340x_thermal.c to the DPTF folder
  ACPI / DPTF: Add DPTF power participant driver
  ACPI / lpat: make it explicitly non-modular
  ACPI / dock: make dock explicitly non-modular
  ACPI / PCI: make pci_slot explicitly non-modular
  ACPI / PMIC: remove modular references from non-modular code
  ACPICA: Linux: Enable ACPI_MUTEX_DEBUG for Linux kernel
  ACPI: Rename configfs.c to acpi_configfs.c to prevent link error
  ACPI / debugger: Add AML debugger documentation
  ACPI: Add documentation describing ACPICA release automation
  ACPI: add support for loading SSDTs via configfs
  ACPI: add support for configfs
  efi / ACPI: load SSTDs from EFI variables
  ...
2016-07-26 17:56:45 -07:00
Paul Gortmaker
186f43608a x86/kernel: Audit and remove any unnecessary uses of module.h
Historically a lot of these existed because we did not have
a distinction between what was modular code and what was providing
support to modules via EXPORT_SYMBOL and friends.  That changed
when we forked out support for the latter into the export.h file.

This means we should be able to reduce the usage of module.h
in code that is obj-y Makefile or bool Kconfig.  The advantage
in doing so is that module.h itself sources about 15 other headers;
adding significantly to what we feed cpp, and it can obscure what
headers we are effectively using.

Since module.h was the source for init.h (for __init) and for
export.h (for EXPORT_SYMBOL) we consider each obj-y/bool instance
for the presence of either and replace as needed.  Build testing
revealed some implicit header usage that was fixed up accordingly.

Note that some bool/obj-y instances remain since module.h is
the header for some exception table entry stuff, and for things
like __init_or_module (code that is tossed when MODULES=n).

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160714001901.31603-4-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 15:06:41 +02:00
Thomas Garnier
0483e1fa6e x86/mm: Implement ASLR for kernel memory regions
Randomizes the virtual address space of kernel memory regions for
x86_64. This first patch adds the infrastructure and does not randomize
any region. The following patches will randomize the physical memory
mapping, vmalloc and vmemmap regions.

This security feature mitigates exploits relying on predictable kernel
addresses. These addresses can be used to disclose the kernel modules
base addresses or corrupt specific structures to elevate privileges
bypassing the current implementation of KASLR. This feature can be
enabled with the CONFIG_RANDOMIZE_MEMORY option.

The order of each memory region is not changed. The feature looks at the
available space for the regions based on different configuration options
and randomizes the base and space between each. The size of the physical
memory mapping is the available physical memory. No performance impact
was detected while testing the feature.

Entropy is generated using the KASLR early boot functions now shared in
the lib directory (originally written by Kees Cook). Randomization is
done on PGD & PUD page table levels to increase possible addresses. The
physical memory mapping code was adapted to support PUD level virtual
addresses. This implementation on the best configuration provides 30,000
possible virtual addresses in average for each memory region.  An
additional low memory page is used to ensure each CPU can start with a
PGD aligned virtual address (for realmode).

x86/dump_pagetable was updated to correctly display each region.

Updated documentation on x86_64 memory layout accordingly.

Performance data, after all patches in the series:

Kernbench shows almost no difference (-+ less than 1%):

Before:

Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.63 (1.2695)
User Time 1034.89 (1.18115) System Time 87.056 (0.456416) Percent CPU 1092.9
(13.892) Context Switches 199805 (3455.33) Sleeps 97907.8 (900.636)

After:

Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.489 (1.10636)
User Time 1034.86 (1.36053) System Time 87.764 (0.49345) Percent CPU 1095
(12.7715) Context Switches 199036 (4298.1) Sleeps 97681.6 (1031.11)

Hackbench shows 0% difference on average (hackbench 90 repeated 10 times):

attemp,before,after 1,0.076,0.069 2,0.072,0.069 3,0.066,0.066 4,0.066,0.068
5,0.066,0.067 6,0.066,0.069 7,0.067,0.066 8,0.063,0.067 9,0.067,0.065
10,0.068,0.071 average,0.0677,0.0677

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Alexander Popov <alpopov@ptsecurity.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/1466556426-32664-6-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08 17:33:46 +02:00
Aleksey Makarov
da3d3f98d2 ACPI / tables: table upgrade: refactor function definitions
Refer initrd_start, initrd_end directly from drivers/acpi/tables.c.
This allows to use the table upgrade feature in architectures
other than x86.  Also this simplifies header files.

The patch renames acpi_table_initrd_init() to acpi_table_upgrade()
(what reflects the purpose of the function) and removes the unneeded
wraps early_acpi_table_init() and early_initrd_acpi_init().

Signed-off-by: Aleksey Makarov <aleksey.makarov@linaro.org>
Acked-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-06-22 01:16:14 +02:00
Lv Zheng
af06f8b7a1 ACPI / x86: Cleanup initrd related code
In arch/x86/kernel/setup.c, the #ifdef kept for CONFIG_ACPI actually is
related to the accessibility of initrd_start/initrd_end, so the stub should
be provided from this source file and should only be dependent on
CONFIG_BLK_DEV_INITRD.

Note that when ACPI=n and BLK_DEV_INITRD=y, early_initrd_acpi_init() is
still a stub because of the stub prepared for early_acpi_table_init().

Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-18 23:59:09 +02:00
Lv Zheng
5ae74f2cc2 ACPI / tables: Move table override mechanisms to tables.c
This patch moves acpi_os_table_override() and
acpi_os_physical_table_override() to tables.c.

Along with the mechanisms, acpi_initrd_initialize_tables() is also moved to
tables.c to form a static function. The following functions are renamed
according to this change:
 1. acpi_initrd_override() -> renamed to early_acpi_table_init(), which
    invokes acpi_table_initrd_init()
 2. acpi_os_physical_table_override() -> which invokes
    acpi_table_initrd_override()
 3. acpi_initialize_initrd_tables() -> renamed to acpi_table_initrd_scan()

Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-18 23:59:08 +02:00
Linus Torvalds
4046d6e81f Revert "x86: remove the kernel code/data/bss resources from /proc/iomem"
This reverts commit c4004b02f8.

Sadly, my hope that nobody would actually use the special kernel entries
in /proc/iomem were dashed by kexec.  Which reads /proc/iomem explicitly
to find the kernel base address.  Nasty.

Anyway, that means we can't do the sane and simple thing and just remove
the entries, and we'll instead have to mask them out based on permissions.

Reported-by: Zhengyu Zhang <zhezhang@redhat.com>
Reported-by: Dave Young <dyoung@redhat.com>
Reported-by: Freeman Zhang <freeman.zhang1992@gmail.com>
Reported-by: Emrah Demir <ed@abdsec.com>
Reported-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-14 12:55:32 -07:00
Linus Torvalds
c4004b02f8 x86: remove the kernel code/data/bss resources from /proc/iomem
Let's see if anybody even notices.  I doubt anybody uses this, and it
does expose addresses that should be randomized, so let's just remove
the code.  It's old and traditional, and it used to be cute, but we
should have removed this long ago.

If it turns out anybody notices and this breaks something, we'll have to
revert this, and maybe we'll end up using other approaches instead
(using %pK or similar).  But removing unnecessary code is always the
preferred option.

Noted-by: Emrah Demir <ed@abdsec.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-06 13:45:07 -07:00
Linus Torvalds
643ad15d47 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 protection key support from Ingo Molnar:
 "This tree adds support for a new memory protection hardware feature
  that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

  There's a background article at LWN.net:

      https://lwn.net/Articles/643797/

  The gist is that protection keys allow the encoding of
  user-controllable permission masks in the pte.  So instead of having a
  fixed protection mask in the pte (which needs a system call to change
  and works on a per page basis), the user can map a (handful of)
  protection mask variants and can change the masks runtime relatively
  cheaply, without having to change every single page in the affected
  virtual memory range.

  This allows the dynamic switching of the protection bits of large
  amounts of virtual memory, via user-space instructions.  It also
  allows more precise control of MMU permission bits: for example the
  executable bit is separate from the read bit (see more about that
  below).

  This tree adds the MM infrastructure and low level x86 glue needed for
  that, plus it adds a high level API to make use of protection keys -
  if a user-space application calls:

        mmap(..., PROT_EXEC);

  or

        mprotect(ptr, sz, PROT_EXEC);

  (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
  this special case, and will set a special protection key on this
  memory range.  It also sets the appropriate bits in the Protection
  Keys User Rights (PKRU) register so that the memory becomes unreadable
  and unwritable.

  So using protection keys the kernel is able to implement 'true'
  PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
  PROT_READ as well.  Unreadable executable mappings have security
  advantages: they cannot be read via information leaks to figure out
  ASLR details, nor can they be scanned for ROP gadgets - and they
  cannot be used by exploits for data purposes either.

  We know about no user-space code that relies on pure PROT_EXEC
  mappings today, but binary loaders could start making use of this new
  feature to map binaries and libraries in a more secure fashion.

  There is other pending pkeys work that offers more high level system
  call APIs to manage protection keys - but those are not part of this
  pull request.

  Right now there's a Kconfig that controls this feature
  (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
  (like most x86 CPU feature enablement code that has no runtime
  overhead), but it's not user-configurable at the moment.  If there's
  any serious problem with this then we can make it configurable and/or
  flip the default"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
  mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
  x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
  mm/core, x86/mm/pkeys: Add execute-only protection keys support
  x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
  x86/mm/pkeys: Allow kernel to modify user pkey rights register
  x86/fpu: Allow setting of XSAVE state
  x86/mm: Factor out LDT init from context init
  mm/core, x86/mm/pkeys: Add arch_validate_pkey()
  mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
  x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
  x86/mm/pkeys: Add Kconfig prompt to existing config option
  x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
  x86/mm/pkeys: Dump PKRU with other kernel registers
  mm/core, x86/mm/pkeys: Differentiate instruction fetches
  x86/mm/pkeys: Optimize fault handling in access_error()
  mm/core: Do not enforce PKEY permissions on remote mm access
  um, pkeys: Add UML arch_*_access_permitted() methods
  mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
  x86/mm/gup: Simplify get_user_pages() PTE bit handling
  ...
2016-03-20 19:08:56 -07:00
Dave Hansen
c1192f8428 x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out we
will effectively get the same function as if we used the weak
generic function.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Mark Williamson <mwilliamson@undo-software.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18 19:46:29 +01:00
Toshi Kani
f33b14a4b9 x86/e820: Set System RAM type and descriptor
Change e820_reserve_resources() to set 'flags' and 'desc' from
e820 types.

Set E820_RESERVED_KERN and E820_RAM's (System RAM) io resource
type to IORESOURCE_SYSTEM_RAM.

Do the same for "Kernel data", "Kernel code", and "Kernel bss",
which are child nodes of System RAM.

I/O resource descriptor is set to 'desc' for entries that are
(and will be) target ranges of walk_iomem_res() and
region_intersects().

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: WANG Chao <chaowang@redhat.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/1453841853-11383-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:57 +01:00
Linus Torvalds
0ffedcda63 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "The main changes in this cycle were:

   - make the debugfs 'kernel_page_tables' file read-only, as it only
     has read ops.  (Borislav Petkov)

   - micro-optimize clflush_cache_range() (Chris Wilson)

   - swiotlb enhancements, which fixes certain KVM emulated devices
     (Igor Mammedov)

   - fix an LDT related debug message (Jan Beulich)

   - modularize CONFIG_X86_PTDUMP (Kees Cook)

   - tone down an overly alarming warning (Laura Abbott)

   - Mark variable __initdata (Rasmus Villemoes)

   - PAT additions (Toshi Kani)"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Micro-optimise clflush_cache_range()
  x86/mm/pat: Change free_memtype() to support shrinking case
  x86/mm/pat: Add untrack_pfn_moved for mremap
  x86/mm: Drop WARN from multi-BAR check
  x86/LDT: Print the real LDT base address
  x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN
  x86/mm: Introduce max_possible_pfn
  x86/mm/ptdump: Make (debugfs)/kernel_page_tables read-only
  x86/mm/mtrr: Mark the 'range_new' static variable in mtrr_calc_range_state() as __initdata
  x86/mm: Turn CONFIG_X86_PTDUMP into a module
2016-01-11 17:16:01 -08:00
Igor Mammedov
8dd3303001 x86/mm: Introduce max_possible_pfn
max_possible_pfn will be used for tracking max possible
PFN for memory that isn't present in E820 table and
could be hotplugged later.

By default max_possible_pfn is initialized with max_pfn,
but later it could be updated with highest PFN of
hotpluggable memory ranges declared in ACPI SRAT table
if any present.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akataria@vmware.com
Cc: fujita.tomonori@lab.ntt.co.jp
Cc: konrad.wilk@oracle.com
Cc: pbonzini@redhat.com
Cc: revers@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1449234426-273049-2-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-06 12:46:31 +01:00
Borislav Petkov
2d5be37d68 x86/microcode: Initialize the driver late when facilities are up
Running microcode_init() from setup_arch() is a bad idea because
not even kmalloc() is ready at that point and the loader does
all kinds of allocations and init/registration with various
subsystems.

Make it a late initcall when required facilities are initialized
so that the microcode driver initialization can succeed too.

Reported-and-tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151120112400.GC4028@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 10:39:49 +01:00
Krzysztof Mazur
68accac392 x86/setup: Fix low identity map for >= 2GB kernel range
The commit f5f3497cad extended the low identity mapping. However, if
the kernel uses more than 2 GB (VMSPLIT_2G_OPT or VMSPLIT_1G memory
split), the normal memory mapping is overwritten by the low identity
mapping causing a crash. To avoid overwritting, limit the low identity
map to cover only memory before kernel range (PAGE_OFFSET).

Fixes: f5f3497cad "x86/setup: Extend low identity map to cover whole kernel range
Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Laszlo Ersek <lersek@redhat.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1446815916-22105-1-git-send-email-krzysiek@podlesie.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-07 10:39:40 +01:00
Linus Torvalds
b831ef2cad Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS changes from Ingo Molnar:
 "The main system reliability related changes were from x86, but also
  some generic RAS changes:

   - AMD MCE error injection subsystem enhancements.  (Aravind
     Gopalakrishnan)

   - Fix MCE and CPU hotplug interaction bug.  (Ashok Raj)

   - kcrash bootup robustness fix.  (Baoquan He)

   - kcrash cleanups.  (Borislav Petkov)

   - x86 microcode driver rework: simplify it by unmodularizing it and
     other cleanups.  (Borislav Petkov)"

* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/mce: Add a default case to the switch in __mcheck_cpu_ancient_init()
  x86/mce: Add a Scalable MCA vendor flags bit
  MAINTAINERS: Unify the microcode driver section
  x86/microcode/intel: Move #ifdef DEBUG inside the function
  x86/microcode/amd: Remove maintainers from comments
  x86/microcode: Remove modularization leftovers
  x86/microcode: Merge the early microcode loader
  x86/microcode: Unmodularize the microcode driver
  x86/mce: Fix thermal throttling reporting after kexec
  kexec/crash: Say which char is the unrecognized
  x86/setup/crash: Check memblock_reserve() retval
  x86/setup/crash: Cleanup some more
  x86/setup/crash: Remove alignment variable
  x86/setup: Cleanup crashkernel reservation functions
  x86/amd_nb, EDAC: Rename amd_get_node_id()
  x86/setup: Do not reserve crashkernel high memory if low reservation failed
  x86/microcode/amd: Do not overwrite final patch levels
  x86/microcode/amd: Extract current patch level read to a function
  x86/ras/mce_amd_inj: Inject bank 4 errors on the NBC
  x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts
  ...
2015-11-03 17:51:33 -08:00
Linus Torvalds
f5a8160c1e Merge branch 'core-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI changes from Ingo Molnar:
 "The main changes in this cycle were:

   - further EFI code generalization to make it more workable for ARM64
   - various extensions, such as 64-bit framebuffer address support,
     UEFI v2.5 EFI_PROPERTIES_TABLE support
   - code modularization simplifications and cleanups
   - new debugging parameters
   - various fixes and smaller additions"

* 'core-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
  efi: Use correct type for struct efi_memory_map::phys_map
  x86/efi: Fix kernel panic when CONFIG_DEBUG_VIRTUAL is enabled
  efi: Add "efi_fake_mem" boot option
  x86/efi: Rename print_efi_memmap() to efi_print_memmap()
  efi: Auto-load the efi-pstore module
  efi: Introduce EFI_NX_PE_DATA bit and set it from properties table
  efi: Add support for UEFIv2.5 Properties table
  efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
  efifb: Add support for 64-bit frame buffer addresses
  efi/arm64: Clean up efi_get_fdt_params() interface
  arm64: Use core efi=debug instead of uefi_debug command line parameter
  efi/x86: Move efi=debug option parsing to core
  drivers/firmware: Make efi/esrt.c driver explicitly non-modular
  efi: Use the generic efi.memmap instead of 'memmap'
  acpi/apei: Use appropriate pgprot_t to map GHES memory
  arm64, acpi/apei: Implement arch_apei_get_mem_attributes()
  arm64/mm: Add PROT_DEVICE_nGnRnE and PROT_NORMAL_WT
  acpi, x86: Implement arch_apei_get_mem_attributes()
  efi, x86: Rearrange efi_mem_attributes()
  ...
2015-11-03 15:05:52 -08:00
Borislav Petkov
9a2bc335f1 x86/microcode: Unmodularize the microcode driver
Make CONFIG_MICROCODE a bool. It was practically a bool already anyway,
since early loader was forcing it to =y.

Regardless, there's no real reason to have something be a module which
gets built-in on the majority of installations out there. And its not
like there's noticeable change in functionality - we still can load late
microcode - just the module glue disappears.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:22:11 +02:00
Borislav Petkov
6f3760570e x86/setup/crash: Check memblock_reserve() retval
memblock_reserve() can fail but the crashkernel reservation code
doesn't check that and this can lead the user into believing
that the crashkernel region was actually reserved. Make sure we
check that return value and we exit early with a failure message
in the error case.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:56 +02:00
Borislav Petkov
f56d55781c x86/setup/crash: Cleanup some more
* Remove unused auto_set variable
* Cleanup local function variable declarations
* Reformat printk string and use pr_info()

No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:56 +02:00
Borislav Petkov
606134f77c x86/setup/crash: Remove alignment variable
Use a macro instead. No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:56 +02:00
Borislav Petkov
97eac21bab x86/setup: Cleanup crashkernel reservation functions
* Shorten variable names
* Realign code, space out for better readability

No code changed:

  # arch/x86/kernel/setup.o:

   text    data     bss     dec     hex filename
   4543    3096   69904   77543   12ee7 setup.o.before
   4543    3096   69904   77543   12ee7 setup.o.after

md5:
   8a1b7c6738a553ca207b56bd84a8f359  setup.o.before.asm
   8a1b7c6738a553ca207b56bd84a8f359  setup.o.after.asm

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:56 +02:00
Baoquan He
eb6db83d10 x86/setup: Do not reserve crashkernel high memory if low reservation failed
People reported that when allocating crashkernel memory using
the ",high" and ",low" syntax, there were cases where the
reservation of the high portion succeeds but the reservation of
the low portion fails.

Then kexec can load the kdump kernel successfully, but booting
the kdump kernel fails as there's no low memory.

The low memory allocation for the kdump kernel can fail on large
systems for a couple of reasons. For example, the manually
specified crashkernel low memory can be too large and thus no
adequate memblock region would be found.

Therefore, we try to reserve low memory for the crash kernel
*after* the high memory portion has been allocated. If that
fails, we free crashkernel high memory too and return. The user
can then take measures accordingly.

Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Baoquan He <bhe@redhat.com>
[ Massage text. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Young <dyoung@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/1445246268-26285-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:55 +02:00
Paolo Bonzini
f5f3497cad x86/setup: Extend low identity map to cover whole kernel range
On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

    $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
    $ gdb
    (gdb) target remote localhost:1234
    (gdb) hb *0x02858f6f
    Hardware assisted breakpoint 1 at 0x2858f6f
    (gdb) c
    Continuing.

    Breakpoint 1, 0x02858f6f in ?? ()
    (gdb) monitor info registers
    ...
    GDT=     0724e000 000000ff
    IDT=     fffbb000 000007ff
    CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690
    ...

The page directory is sane:

    (gdb) x/4wx 0x32b7000
    0x32b7000:	0x03398063	0x03399063	0x0339a063	0x0339b063
    (gdb) x/4wx 0x3398000
    0x3398000:	0x00000163	0x00001163	0x00002163	0x00003163
    (gdb) x/4wx 0x3399000
    0x3399000:	0x00400003	0x00401003	0x00402003	0x00403003

but our particular page directory entry is empty:

    (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
    0x32b7070:	0x00000000

[ It appears that you can skate past this issue if you don't receive
  any interrupts while the bogus GDT pointer is loaded, or if you avoid
  reloading the segment registers in general.

  Andy Lutomirski provides some additional insight:

   "AFAICT it's entirely permissible for the GDTR and/or LDT
    descriptor to point to unmapped memory.  Any attempt to use them
    (segment loads, interrupts, IRET, etc) will try to access that memory
    as if the access came from CPL 0 and, if the access fails, will
    generate a valid page fault with CR2 pointing into the GDT or
    LDT."

  Up until commit 23a0d4e8fa ("efi: Disable interrupts around EFI
  calls, not in the epilog/prolog calls") interrupts were disabled
  around the prolog and epilog calls, and the functional GDT was
  re-installed before interrupts were re-enabled.

  Which explains why no one has hit this issue until now. ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
[ Updated changelog. ]
2015-10-16 10:52:29 +01:00
Ingo Molnar
790a2ee242 * Make the EFI System Resource Table (ESRT) driver explicitly
non-modular by ripping out the module_* code since Kconfig doesn't
    allow it to be built as a module anyway - Paul Gortmaker
 
  * Make the x86 efi=debug kernel parameter, which enables EFI debug
    code and output, generic and usable by arm64 - Leif Lindholm
 
  * Add support to the x86 EFI boot stub for 64-bit Graphics Output
    Protocol frame buffer addresses - Matt Fleming
 
  * Detect when the UEFI v2.5 EFI_PROPERTIES_TABLE feature is enabled
    in the firmware and set an efi.flags bit so the kernel knows when
    it can apply more strict runtime mapping attributes - Ard Biesheuvel
 
  * Auto-load the efi-pstore module on EFI systems, just like we
    currently do for the efivars module - Ben Hutchings
 
  * Add "efi_fake_mem" kernel parameter which allows the system's EFI
    memory map to be updated with additional attributes for specific
    memory ranges. This is useful for testing the kernel code that handles
    the EFI_MEMORY_MORE_RELIABLE memmap bit even if your firmware
    doesn't include support - Taku Izumi
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWG7OwAAoJEC84WcCNIz1VEEEP/0SsdrwJ66B4MfP5YNjqHYWm
 +OTHR6Ovv2i10kc+NjOV/GN8sWPndnkLfIfJ4EqJ9BoQ9PDEYZilV2aleSQ4DrPm
 H7uGwBXQkfd76tZKX9pMToK76mkhg6M7M2LR3Suv3OGfOEzuozAOt3Ez37lpksTN
 2ByhHr/oGbhu99jC2ki5+k0ySH8PMqDBRxqrPbBzTD+FfB7bM11vAJbSNbSMQ21R
 ZwX0acZBLqb9J2Vf7tDsW+fCfz0TFo8JHW8jdLRFm/y2dpquzxswkkBpODgA8+VM
 0F5UbiUdkaIRug75I6N/OJ8+yLwdzuxm7ul+tbS3JrXGLAlK3850+dP2Pr5zQ2Ce
 zaYGRUy+tD5xMXqOKgzpu+Ia8XnDRLhOlHabiRd5fG6ZC9nR8E9uK52g79voSN07
 pADAJnVB03CGV/HdduDOI4C4UykUKubuArbQVkqWJcecV1Jic/tYI0gjeACmU1VF
 v8FzXpBUe3U3A0jauOz8PBz8M+k5qky/GbIrnEvXreBtKdt999LN9fykTN7rBOpo
 dk/6vTR1Jyv3aYc9EXHmRluktI6KmfWCqmRBOIgQveX1VhdRM+1w2LKC0+8co3dF
 v/DBh19KDyfPI8eOvxKykhn164UeAt03EXqDa46wFGr2nVOm/JiShL/d+QuyYU4G
 8xb/rET4JrhCG4gFMUZ7
 =1Oee
 -----END PGP SIGNATURE-----

Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into core/efi

Pull v4.4 EFI updates from Matt Fleming:

  - Make the EFI System Resource Table (ESRT) driver explicitly
    non-modular by ripping out the module_* code since Kconfig doesn't
    allow it to be built as a module anyway. (Paul Gortmaker)

  - Make the x86 efi=debug kernel parameter, which enables EFI debug
    code and output, generic and usable by arm64. (Leif Lindholm)

  - Add support to the x86 EFI boot stub for 64-bit Graphics Output
    Protocol frame buffer addresses. (Matt Fleming)

  - Detect when the UEFI v2.5 EFI_PROPERTIES_TABLE feature is enabled
    in the firmware and set an efi.flags bit so the kernel knows when
    it can apply more strict runtime mapping attributes - Ard Biesheuvel

  - Auto-load the efi-pstore module on EFI systems, just like we
    currently do for the efivars module. (Ben Hutchings)

  - Add "efi_fake_mem" kernel parameter which allows the system's EFI
    memory map to be updated with additional attributes for specific
    memory ranges. This is useful for testing the kernel code that handles
    the EFI_MEMORY_MORE_RELIABLE memmap bit even if your firmware
    doesn't include support. (Taku Izumi)

Note: there is a semantic conflict between the following two commits:

  8a53554e12 ("x86/efi: Fix multiple GOP device support")
  ae2ee627dc ("efifb: Add support for 64-bit frame buffer addresses")

I fixed up the interaction in the merge commit, changing the type of
current_fb_base from u32 to u64.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-14 16:51:34 +02:00
Taku Izumi
0f96a99dab efi: Add "efi_fake_mem" boot option
This patch introduces new boot option named "efi_fake_mem".
By specifying this parameter, you can add arbitrary attribute
to specific memory range.
This is useful for debugging of Address Range Mirroring feature.

For example, if "efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000"
is specified, the original (firmware provided) EFI memmap will be
updated so that the specified memory regions have
EFI_MEMORY_MORE_RELIABLE attribute (0x10000):

 <original>
   efi: mem36: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000100000000-0x00000020a0000000) (129536MB)

 <updated>
   efi: mem36: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] range=[0x0000000100000000-0x0000000180000000) (2048MB)
   efi: mem37: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000180000000-0x00000010a0000000) (61952MB)
   efi: mem38: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] range=[0x00000010a0000000-0x0000001120000000) (2048MB)
   efi: mem39: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000001120000000-0x00000020a0000000) (63488MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2015-10-12 14:20:09 +01:00
Dave Young
2965faa5e0 kexec: split kexec_load syscall from kexec core code
There are two kexec load syscalls, kexec_load another and kexec_file_load.
 kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Mark Salter
5dd2c4bded x86: use generic early mem copy
The early_ioremap library now has a generic copy_from_early_mem()
function.  Use the generic copy function for x86 relocate_initrd().

[akpm@linux-foundation.org: remove MAX_MAP_CHUNK define, per Yinghai Lu]
Signed-off-by: Mark Salter <msalter@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Paolo Pisati
949163015c x86/boot: Obsolete the MCA sys_desc_table
The kernel does not support the MCA bus anymroe, so mark sys_desc_table
as obsolete: remove any reference from the code together with the remaining
of MCA logic.

bloat-o-meter output:

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-55 (-55)
  function                                     old     new   delta
  i386_start_kernel                            128     119      -9
  setup_arch                                  1421    1375     -46

Signed-off-by: Paolo Pisati <p.pisati@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1437409430-8491-1-git-send-email-p.pisati@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-21 10:55:11 +02:00
Linus Torvalds
b1be9ead13 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Two FPU rewrite related fixes.  This addresses all known x86
  regressions at this stage.  Also some other misc fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Fix boot crash in the early FPU code
  x86/asm/entry/64: Update path names
  x86/fpu: Fix FPU related boot regression when CPUID masking BIOS feature is enabled
  x86/boot/setup: Clean up the e820_reserve_setup_data() code
  x86/kaslr: Fix typo in the KASLR_FLAG documentation
2015-07-04 08:58:50 -07:00
Ingo Molnar
dc5fb575df Merge branch 'x86/boot' into x86/urgent
Merge branch that got ready.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-30 07:57:04 +02:00
Tony Luck
b05b9f5f9d x86, mirror: x86 enabling - find mirrored memory ranges
UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
address ranges.  See UEFI 2.5 spec pages 157-158:

  http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf

On EFI enabled systems scan the memory map and tell memblock about any
mirrored ranges.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:45 -07:00
Linus Torvalds
0faef837e4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fixes from Jiri Kosina:

 - symbol lookup locking fix, from Miroslav Benes

 - error handling improvements in case of failure of the module coming
   notifier, from Minfei Huang

 - we were too pessimistic when kASLR has been enabled on x86 and were
   dropping address hints on the floor unnecessarily in such case.  Fix
   from Jiri Kosina

 - a few other small fixes and cleanups

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: add module locking around kallsyms calls
  livepatch: annotate klp_init() with __init
  livepatch: introduce patch/func-walking helpers
  livepatch: make kobject in klp_object statically allocated
  livepatch: Prevent patch inconsistencies if the coming module notifier fails
  livepatch: match return value to function signature
  x86: kaslr: fix build due to missing ALIGN definition
  livepatch: x86: make kASLR logic more accurate
  x86: introduce kaslr_offset()
2015-06-23 14:07:26 -07:00
Linus Torvalds
d70b3ef54c Merge branch 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Ingo Molnar:
 "There were so many changes in the x86/asm, x86/apic and x86/mm topics
  in this cycle that the topical separation of -tip broke down somewhat -
  so the result is a more traditional architecture pull request,
  collected into the 'x86/core' topic.

  The topics were still maintained separately as far as possible, so
  bisectability and conceptual separation should still be pretty good -
  but there were a handful of merge points to avoid excessive
  dependencies (and conflicts) that would have been poorly tested in the
  end.

  The next cycle will hopefully be much more quiet (or at least will
  have fewer dependencies).

  The main changes in this cycle were:

   * x86/apic changes, with related IRQ core changes: (Jiang Liu, Thomas
     Gleixner)

     - This is the second and most intrusive part of changes to the x86
       interrupt handling - full conversion to hierarchical interrupt
       domains:

          [IOAPIC domain]   -----
                                 |
          [MSI domain]      --------[Remapping domain] ----- [ Vector domain ]
                                 |   (optional)          |
          [HPET MSI domain] -----                        |
                                                         |
          [DMAR domain]     -----------------------------
                                                         |
          [Legacy domain]   -----------------------------

       This now reflects the actual hardware and allowed us to distangle
       the domain specific code from the underlying parent domain, which
       can be optional in the case of interrupt remapping.  It's a clear
       separation of functionality and removes quite some duct tape
       constructs which plugged the remap code between ioapic/msi/hpet
       and the vector management.

     - Intel IOMMU IRQ remapping enhancements, to allow direct interrupt
       injection into guests (Feng Wu)

   * x86/asm changes:

     - Tons of cleanups and small speedups, micro-optimizations.  This
       is in preparation to move a good chunk of the low level entry
       code from assembly to C code (Denys Vlasenko, Andy Lutomirski,
       Brian Gerst)

     - Moved all system entry related code to a new home under
       arch/x86/entry/ (Ingo Molnar)

     - Removal of the fragile and ugly CFI dwarf debuginfo annotations.
       Conversion to C will reintroduce many of them - but meanwhile
       they are only getting in the way, and the upstream kernel does
       not rely on them (Ingo Molnar)

     - NOP handling refinements. (Borislav Petkov)

   * x86/mm changes:

     - Big PAT and MTRR rework: making the code more robust and
       preparing to phase out exposing direct MTRR interfaces to drivers -
       in favor of using PAT driven interfaces (Toshi Kani, Luis R
       Rodriguez, Borislav Petkov)

     - New ioremap_wt()/set_memory_wt() interfaces to support
       Write-Through cached memory mappings.  This is especially
       important for good performance on NVDIMM hardware (Toshi Kani)

   * x86/ras changes:

     - Add support for deferred errors on AMD (Aravind Gopalakrishnan)

       This is an important RAS feature which adds hardware support for
       poisoned data.  That means roughly that the hardware marks data
       which it has detected as corrupted but wasn't able to correct, as
       poisoned data and raises an APIC interrupt to signal that in the
       form of a deferred error.  It is the OS's responsibility then to
       take proper recovery action and thus prolonge system lifetime as
       far as possible.

     - Add support for Intel "Local MCE"s: upcoming CPUs will support
       CPU-local MCE interrupts, as opposed to the traditional system-
       wide broadcasted MCE interrupts (Ashok Raj)

     - Misc cleanups (Borislav Petkov)

   * x86/platform changes:

     - Intel Atom SoC updates

  ... and lots of other cleanups, fixlets and other changes - see the
  shortlog and the Git log for details"

* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (222 commits)
  x86/hpet: Use proper hpet device number for MSI allocation
  x86/hpet: Check for irq==0 when allocating hpet MSI interrupts
  x86/mm/pat, drivers/infiniband/ipath: Use arch_phys_wc_add() and require PAT disabled
  x86/mm/pat, drivers/media/ivtv: Use arch_phys_wc_add() and require PAT disabled
  x86/platform/intel/baytrail: Add comments about why we disabled HPET on Baytrail
  genirq: Prevent crash in irq_move_irq()
  genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain
  iommu, x86: Properly handle posted interrupts for IOMMU hotplug
  iommu, x86: Provide irq_remapping_cap() interface
  iommu, x86: Setup Posted-Interrupts capability for Intel iommu
  iommu, x86: Add cap_pi_support() to detect VT-d PI capability
  iommu, x86: Avoid migrating VT-d posted interrupts
  iommu, x86: Save the mode (posted or remapped) of an IRTE
  iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
  iommu: dmar: Provide helper to copy shared irte fields
  iommu: dmar: Extend struct irte for VT-d Posted-Interrupts
  iommu: Add new member capability to struct irq_remap_ops
  x86/asm/entry/64: Disentangle error_entry/exit gsbase/ebx/usermode code
  x86/asm/entry/32: Shorten __audit_syscall_entry() args preparation
  x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry()
  ...
2015-06-22 17:59:09 -07:00
Joerg Roedel
94fb933418 x86/crash: Allocate enough low memory when crashkernel=high
When the crash kernel is loaded above 4GiB in memory, the
first kernel allocates only 72MiB of low-memory for the DMA
requirements of the second kernel. On systems with many
devices this is not enough and causes device driver
initialization errors and failed crash dumps. Testing by
SUSE and Redhat has shown that 256MiB is a good default
value for now and the discussion has lead to this value as
well. So set this default value to 256MiB to make sure there
is enough memory available for DMA.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
[ Reflow comment. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jörg Rödel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/1433500202-25531-4-git-send-email-joro@8bytes.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-11 08:28:39 +02:00
Wei Yang
f2af7d25b4 x86/boot/setup: Clean up the e820_reserve_setup_data() code
Deobfuscate the 'found' logic, it can be replaced with a simple:

	if (!pa_data)
		return;

and 'found' can be eliminated.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1433398729-8314-1-git-send-email-weiyang@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-05 13:53:22 +02:00
Jiri Kosina
4545c89880 x86: introduce kaslr_offset()
Offset that has been chosen for kaslr during kernel decompression can be
easily computed as a difference between _text and __START_KERNEL. We are
already making use of this in dump_kernel_offset() notifier and in
arch_crash_save_vmcoreinfo().

Introduce kaslr_offset() that makes this computation instead of hard-coding
it, so that other kernel code (such as live patching) can make use of it.
Also convert existing users to make use of it.

This patch is equivalent transofrmation without any effects on the resulting
code:

	$ diff -u vmlinux.old.asm vmlinux.new.asm
	--- vmlinux.old.asm     2015-04-28 17:55:19.520983368 +0200
	+++ vmlinux.new.asm     2015-04-28 17:55:24.141206072 +0200
	@@ -1,5 +1,5 @@

	-vmlinux.old:     file format elf64-x86-64
	+vmlinux.new:     file format elf64-x86-64

	Disassembly of section .text:
	$

Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-04-29 16:51:33 +02:00
Thomas Gleixner
ca1b88622e x86: Remove more unmodified io_apic_ops
io_apic_ops.init() is either NULL, if IO-APIC support is disabled at
compile time or native_io_apic_init_mappings(). No point to have that
as we can achieve the same thing with an empty inline.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-24 15:36:54 +02:00
Linus Torvalds
6cf78d4b37 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
 "The main changes in this cycle were:

   - reduce the x86/32 PAE per task PGD allocation overhead from 4K to
     0.032k (Fenghua Yu)

   - early_ioremap/memunmap() usage cleanups (Juergen Gross)

   - gbpages support cleanups (Luis R Rodriguez)

   - improve AMD Bulldozer (family 0x15) ASLR I$ aliasing workaround to
     increase randomization by 3 bits (per bootup) (Hector
     Marco-Gisbert)

   - misc fixlets"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Improve AMD Bulldozer ASLR workaround
  x86/mm/pat: Initialize __cachemode2pte_tbl[] and __pte2cachemode_tbl[] in a bit more readable fashion
  init.h: Clean up the __setup()/early_param() macros
  x86/mm: Simplify probe_page_size_mask()
  x86/mm: Further simplify 1 GB kernel linear mappings handling
  x86/mm: Use early_param_on_off() for direct_gbpages
  init.h: Add early_param_on_off()
  x86/mm: Simplify enabling direct_gbpages
  x86/mm: Use IS_ENABLED() for direct_gbpages
  x86/mm: Unexport set_memory_ro() and set_memory_rw()
  x86/mm, efi: Use early_ioremap() in arch/x86/platform/efi/efi-bgrt.c
  x86/mm: Use early_memunmap() instead of early_iounmap()
  x86/mm/pat: Ensure different messages in STRICT_DEVMEM and PAT cases
  x86/mm: Reduce PAE-mode per task pgd allocation overhead from 4K to 32 bytes
2015-04-13 13:31:32 -07:00
Borislav Petkov
78cac48c04 x86/mm/KASLR: Propagate KASLR status to kernel proper
Commit:

  e2b32e6785 ("x86, kaslr: randomize module base load address")

made module base address randomization unconditional and didn't regard
disabled KKASLR due to CONFIG_HIBERNATION and command line option
"nokaslr". For more info see (now reverted) commit:

  f47233c2d3 ("x86/mm/ASLR: Propagate base load address calculation")

In order to propagate KASLR status to kernel proper, we need a single bit
in boot_params.hdr.loadflags and we've chosen bit 1 thus leaving the
top-down allocated bits for bits supposed to be used by the bootloader.

Originally-From: Jiri Kosina <jkosina@suse.cz>
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03 15:26:15 +02:00
Borislav Petkov
69797dafe3 Revert "x86/mm/ASLR: Propagate base load address calculation"
This reverts commit:

  f47233c2d3 ("x86/mm/ASLR: Propagate base load address calculation")

The main reason for the revert is that the new boot flag does not work
at all currently, and in order to make this work, we need non-trivial
changes to the x86 boot code which we didn't manage to get done in
time for merging.

And even if we did, they would've been too risky so instead of
rushing things and break booting 4.1 on boxes left and right, we
will be very strict and conservative and will take our time with
this to fix and test it properly.

Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: H. Peter Anvin <hpa@linux.intel.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Junjie Mao <eternal.n08@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/20150316100628.GD22995@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-16 11:18:21 +01:00
Juergen Gross
8d4a40bc06 x86/mm: Use early_memunmap() instead of early_iounmap()
Memory mapped via early_memremap() should be unmapped with
early_memunmap() instead of early_iounmap().

Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: matt.fleming@intel.com
Link: http://lkml.kernel.org/r/1424769211-11378-2-git-send-email-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-24 15:58:06 +01:00
Linus Torvalds
5fbe4c224c Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
 "This contains:

   - EFI fixes
   - a boot printout fix
   - ASLR/kASLR fixes
   - intel microcode driver fixes
   - other misc fixes

  Most of the linecount comes from an EFI revert"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/ASLR: Avoid PAGE_SIZE redefinition for UML subarch
  x86/microcode/intel: Handle truncated microcode images more robustly
  x86/microcode/intel: Guard against stack overflow in the loader
  x86, mm/ASLR: Fix stack randomization on 64-bit systems
  x86/mm/init: Fix incorrect page size in init_memory_mapping() printks
  x86/mm/ASLR: Propagate base load address calculation
  Documentation/x86: Fix path in zero-page.txt
  x86/apic: Fix the devicetree build in certain configs
  Revert "efi/libstub: Call get_memory_map() to obtain map and desc sizes"
  x86/efi: Avoid triple faults during EFI mixed mode calls
2015-02-21 10:41:29 -08:00
Ingo Molnar
a267b0a349 Merge branch 'tip-x86-kaslr' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent
Pull ASLR and kASLR fixes from Borislav Petkov:

  - Add a global flag announcing KASLR state so that relevant code can do
    informed decisions based on its setting. (Jiri Kosina)

  - Fix a stack randomization entropy decrease bug. (Hector Marco-Gisbert)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-19 12:31:34 +01:00
Jiri Kosina
f47233c2d3 x86/mm/ASLR: Propagate base load address calculation
Commit:

  e2b32e6785 ("x86, kaslr: randomize module base load address")

makes the base address for module to be unconditionally randomized in
case when CONFIG_RANDOMIZE_BASE is defined and "nokaslr" option isn't
present on the commandline.

This is not consistent with how choose_kernel_location() decides whether
it will randomize kernel load base.

Namely, CONFIG_HIBERNATION disables kASLR (unless "kaslr" option is
explicitly specified on kernel commandline), which makes the state space
larger than what module loader is looking at. IOW CONFIG_HIBERNATION &&
CONFIG_RANDOMIZE_BASE is a valid config option, kASLR wouldn't be applied
by default in that case, but module loader is not aware of that.

Instead of fixing the logic in module.c, this patch takes more generic
aproach. It introduces a new bootparam setup data_type SETUP_KASLR and
uses that to pass the information whether kaslr has been applied during
kernel decompression, and sets a global 'kaslr_enabled' variable
accordingly, so that any kernel code (module loading, livepatching, ...)
can make decisions based on its value.

x86 module loader is converted to make use of this flag.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Link: https://lkml.kernel.org/r/alpine.LNX.2.00.1502101411280.10719@pobox.suse.cz
[ Always dump correct kaslr status when panicking ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 11:38:54 +01:00
Linus Torvalds
37507717de Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf updates from Ingo Molnar:
 "This series tightens up RDPMC permissions: currently even highly
  sandboxed x86 execution environments (such as seccomp) have permission
  to execute RDPMC, which may leak various perf events / PMU state such
  as timing information and other CPU execution details.

  This 'all is allowed' RDPMC mode is still preserved as the
  (non-default) /sys/devices/cpu/rdpmc=2 setting.  The new default is
  that RDPMC access is only allowed if a perf event is mmap-ed (which is
  needed to correctly interpret RDPMC counter values in any case).

  As a side effect of these changes CR4 handling is cleaned up in the
  x86 code and a shadow copy of the CR4 value is added.

  The extra CR4 manipulation adds ~ <50ns to the context switch cost
  between rdpmc-capable and rdpmc-non-capable mms"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks
  perf/x86: Only allow rdpmc if a perf_event is mapped
  perf: Pass the event to arch_perf_update_userpage()
  perf: Add pmu callbacks to track event mapping and unmapping
  x86: Add a comment clarifying LDT context switching
  x86: Store a per-cpu shadow copy of CR4
  x86: Clean up cr4 manipulation
2015-02-16 14:58:12 -08:00
Andrey Ryabinin
ef7f0d6a6c x86_64: add KASan support
This patch adds arch specific code for kernel address sanitizer.

16TB of virtual addressed used for shadow memory.  It's located in range
[ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup
stacks.

At early stage we map whole shadow region with zero page.  Latter, after
pages mapped to direct mapping address range we unmap zero pages from
corresponding shadow (see kasan_map_shadow()) and allocate and map a real
shadow memory reusing vmemmap_populate() function.

Also replace __pa with __pa_nodebug before shadow initialized.  __pa with
CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr)
__phys_addr is instrumented, so __asan_load could be called before shadow
area initialized.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:41 -08:00
Andy Lutomirski
1e02ce4ccc x86: Store a per-cpu shadow copy of CR4
Context switches and TLB flushes can change individual bits of CR4.
CR4 reads take several cycles, so store a shadow copy of CR4 in a
per-cpu variable.

To avoid wasting a cache line, I added the CR4 shadow to
cpu_tlbstate, which is already touched in switch_mm.  The heaviest
users of the cr4 shadow will be switch_mm and __switch_to_xtra, and
__switch_to_xtra is called shortly after switch_mm during context
switch, so the cacheline is likely to be hot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:42 +01:00
WANG Chao
7389882c81 x86, setup: Let early_memremap() handle page alignment
early_memremap() takes care of page alignment and map size, so we can
just remap the required data size and get rid of the adjustments in
the setup code.

[tglx: Massaged changelog ]

Signed-off-by: WANG Chao <chaowang@redhat.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Link: http://lkml.kernel.org/r/1420628150-16872-1-git-send-email-chaowang@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-23 16:14:26 +01:00
Linus Torvalds
3100e448e7 Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso updates from Ingo Molnar:
 "Various vDSO updates from Andy Lutomirski, mostly cleanups and
  reorganization to improve maintainability, but also some
  micro-optimizations and robustization changes"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86_64/vsyscall: Restore orig_ax after vsyscall seccomp
  x86_64: Add a comment explaining the TASK_SIZE_MAX guard page
  x86_64,vsyscall: Make vsyscall emulation configurable
  x86_64, vsyscall: Rewrite comment and clean up headers in vsyscall code
  x86_64, vsyscall: Turn vsyscalls all the way off when vsyscall==none
  x86,vdso: Use LSL unconditionally for vgetcpu
  x86: vdso: Fix build with older gcc
  x86_64/vdso: Clean up vgetcpu init and merge the vdso initcalls
  x86_64/vdso: Remove jiffies from the vvar page
  x86/vdso: Make the PER_CPU segment 32 bits
  x86/vdso: Make the PER_CPU segment start out accessed
  x86/vdso: Change the PER_CPU segment to use struct desc_struct
  x86_64/vdso: Move getcpu code from vsyscall_64.c to vdso/vma.c
  x86_64/vsyscall: Move all of the gate_area code to vsyscall_64.c
2014-12-10 14:24:20 -08:00
Dave Hansen
fe3d197f84 x86, mpx: On-demand kernel allocation of bounds tables
This is really the meat of the MPX patch set.  If there is one patch to
review in the entire series, this is the one.  There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner.  (small FAQ below).

Long Description:

This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers.  The prctl() is an
explicit signal from userspace.

PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.

PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.

PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address.  Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.

Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive.  Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.

==== Why does the hardware even have these Bounds Tables? ====

MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".

They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.

The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.

==== Why not do this in userspace? ====

This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.

Q: Can virtual space simply be reserved for the bounds tables so
   that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
   area needs 4*X GB of virtual space, plus 2GB for the bounds
   directory. If we were to preallocate them for the 128TB of
   user virtual address space, we would need to reserve 512TB+2GB,
   which is larger than the entire virtual address space today.
   This means they can not be reserved ahead of time. Also, a
   single process's pre-popualated bounds directory consumes 2GB
   of virtual *AND* physical memory. IOW, it's completely
   infeasible to prepopulate bounds directories.

Q: Can we preallocate bounds table space at the same time memory
   is allocated which might contain pointers that might eventually
   need bounds tables?
A: This would work if we could hook the site of each and every
   memory allocation syscall. This can be done for small,
   constrained applications. But, it isn't practical at a larger
   scale since a given app has no way of controlling how all the
   parts of the app might allocate memory (think libraries). The
   kernel is really the only place to intercept these calls.

Q: Could a bounds fault be handed to userspace and the tables
   allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
   handler functions and even if mmap() would work it still
   requires locking or nasty tricks to keep track of the
   allocation state there.

Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.

Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18 00:58:53 +01:00
Andy Lutomirski
1ad83c858c x86_64,vsyscall: Make vsyscall emulation configurable
This adds CONFIG_X86_VSYSCALL_EMULATION, guarded by CONFIG_EXPERT.
Turning it off completely disables vsyscall emulation, saving ~3.5k
for vsyscall_64.c, 4k for vsyscall_emu_64.S (the fake vsyscall
page), some tiny amount of core mm code that supports a gate area,
and possibly 4k for a wasted pagetable.  The latter is because the
vsyscall addresses are misaligned and fit poorly in the fixmap.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/406db88b8dd5f0cbbf38216d11be34bbb43c7eae.1414618407.git.luto@amacapital.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03 21:44:57 +01:00
Weijie Yang
3c325f8233 x86, cma: Reserve DMA contiguous area after initmem_init()
Fengguang Wu reported a boot crash on the x86 platform
via the 0-day Linux Kernel Performance Test:

  cma: dma_contiguous_reserve: reserving 31 MiB for global area
  BUG: Int 6: CR2   (null)
  [<41850786>] dump_stack+0x16/0x18
  [<41d2b1db>] early_idt_handler+0x6b/0x6b
  [<41072227>] ? __phys_addr+0x2e/0xca
  [<41d4ee4d>] cma_declare_contiguous+0x3c/0x2d7
  [<41d6d359>] dma_contiguous_reserve_area+0x27/0x47
  [<41d6d4d1>] dma_contiguous_reserve+0x158/0x163
  [<41d33e0f>] setup_arch+0x79b/0xc68
  [<41d2b7cf>] start_kernel+0x9c/0x456
  [<41d2b2ca>] i386_start_kernel+0x79/0x7d

(See details at: https://lkml.org/lkml/2014/10/8/708)

It is because dma_contiguous_reserve() is called before
initmem_init() in x86, the variable high_memory is not
initialized but accessed by __pa(high_memory) in
dma_contiguous_reserve().

This patch moves dma_contiguous_reserve() after initmem_init()
so that high_memory is initialized before accessed.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: iamjoonsoo.kim@lge.com
Cc: 'Linux-MM' <linux-mm@kvack.org>
Cc: 'Weijie Yang' <weijie.yang.kh@gmail.com>
Link: http://lkml.kernel.org/r/000101cfef69%2431e528a0%2495af79e0%24%25yang@samsung.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 07:36:50 +01:00
Bryan O'Donoghue
2075244f9b x86: Quark: Comment setup_arch() to document TLB/PGE bug
Quark SoC X1000 advertises Page Global Enable for it's
Translation Lookaside Buffer via cpuid. The silicon does not
in fact support PGE and hence will not flush the TLB when CR4.PGE
is rewritten. The Quark documentation makes clear the necessity to
instead rewrite CR3 in order to flush any TLB entries, irrespective
of the state of CR4.PGE or an individual PTE.PGE

See Intel Quark Core DevMan_001.pdf section 6.4.11

In setup.c setup_arch() the code will load_cr3() and then do a
__flush_tlb_all().

On Quark the entire TLB will be flushed at the load_cr3().
The __flush_tlb_all() have no effect and can be safely ignored.

Later on in the boot process we switch off the flag for cpu_has_pge()
which means that subsequent calls to __flush_tlb_all() will
call __flush_tlb() not __flush_tlb_global() flushing the TLB in the
correct way via load_cr3() not CR4.PGE rewrite

This patch documents the behaviour of flushing the TLB for Quark in
setup_arch()

Comment text suggested by Thomas Gleixner

Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Cc: davej@redhat.com
Cc: hmh@hmh.eng.br
Link: http://lkml.kernel.org/r/1412641189-12415-2-git-send-email-pure.logic@nexus-software.ie
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-08 10:07:46 +02:00
Daniel Kiper
9402973d49 arch/x86: Replace plain strings with constants
We've got constants, so let's use them instead of hard-coded values.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-07-18 21:23:59 +01:00
Akinobu Mita
5ea3b1b2f8 cma: add placement specifier for "cma=" kernel parameter
Currently, "cma=" kernel parameter is used to specify the size of CMA,
but we can't specify where it is located.  We want to locate CMA below
4GB for devices only supporting 32-bit addressing on 64-bit systems
without iommu.

This enables to specify the placement of CMA by extending "cma=" kernel
parameter.

Examples:
 1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
 2. locate 64MB CMA exact at 512MB by "cma=64M@512M"

Note that the DMA contiguous memory allocator on x86 assumes that
page_address() works for the pages to allocate.  So this change requires
to limit end address of contiguous memory area upto max_pfn_mapped to
prevent from locating it on highmem area by the argument of
dma_contiguous_reserve().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:53:57 -07:00
Linus Torvalds
467cbd207a Merge branch 'x86-nuke-platforms-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 old platform removal from Peter Anvin:
 "This patchset removes support for several completely obsolete
  platforms, where the maintainers either have completely vanished or
  acked the removal.  For some of them it is questionable if there even
  exists functional specimens of the hardware"

Geert Uytterhoeven apparently thought this was a April Fool's pull request ;)

* 'x86-nuke-platforms-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, platforms: Remove NUMAQ
  x86, platforms: Remove SGI Visual Workstation
  x86, apic: Remove support for IBM Summit/EXA chipset
  x86, apic: Remove support for ia32-based Unisys ES7000
2014-04-02 13:15:58 -07:00
Matt Fleming
4fd69331ad Merge remote-tracking branch 'tip/x86/urgent' into efi-for-mingo
Conflicts:
	arch/x86/include/asm/efi.h
2014-03-05 17:31:41 +00:00
Borislav Petkov
a5d90c923b x86/efi: Quirk out SGI UV
Alex reported hitting the following BUG after the EFI 1:1 virtual
mapping work was merged,

 kernel BUG at arch/x86/mm/init_64.c:351!
 invalid opcode: 0000 [#1] SMP
 Call Trace:
  [<ffffffff818aa71d>] init_extra_mapping_uc+0x13/0x15
  [<ffffffff818a5e20>] uv_system_init+0x22b/0x124b
  [<ffffffff8108b886>] ? clockevents_register_device+0x138/0x13d
  [<ffffffff81028dbb>] ? setup_APIC_timer+0xc5/0xc7
  [<ffffffff8108b620>] ? clockevent_delta2ns+0xb/0xd
  [<ffffffff818a3a92>] ? setup_boot_APIC_clock+0x4a8/0x4b7
  [<ffffffff8153d955>] ? printk+0x72/0x74
  [<ffffffff818a1757>] native_smp_prepare_cpus+0x389/0x3d6
  [<ffffffff818957bc>] kernel_init_freeable+0xb7/0x1fb
  [<ffffffff81535530>] ? rest_init+0x74/0x74
  [<ffffffff81535539>] kernel_init+0x9/0xff
  [<ffffffff81541dfc>] ret_from_fork+0x7c/0xb0
  [<ffffffff81535530>] ? rest_init+0x74/0x74

Getting this thing to work with the new mapping scheme would need more
work, so automatically switch to the old memmap layout for SGI UV.

Acked-by: Russ Anderson <rja@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-03-04 23:43:33 +00:00
Matt Fleming
3e90959921 efi: Move facility flags to struct efi
As we grow support for more EFI architectures they're going to want the
ability to query which EFI features are available on the running system.
Instead of storing this information in an architecture-specific place,
stick it in the global 'struct efi', which is already the central
location for EFI state.

While we're at it, let's change the return value of efi_enabled() to be
bool and replace all references to 'facility' with 'feature', which is
the usual word used to describe the attributes of the running system.

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-03-04 16:16:16 +00:00
H. Peter Anvin
c5f9ee3d66 x86, platforms: Remove SGI Visual Workstation
The SGI Visual Workstation seems to be dead; remove support so we
don't have to continue maintaining it.

Cc: Andrey Panin <pazke@donpac.ru>
Cc: Michael Reed <mdr@sgi.com>
Link: http://lkml.kernel.org/r/530CFD6C.7040705@zytor.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-27 08:07:39 -08:00
Yinghai Lu
4ce7a8697c x86: revert wrong memblock current limit setting
Dave reported big numa system booting is broken.

It turns out that commit 5b6e529521 ("x86: memblock: set current limit
to max low memory address") sets the limit to low wrongly.

max_low_pfn_mapped is different from max_pfn_mapped.
max_low_pfn_mapped is always under 4G.

That will memblock_alloc_nid all go under 4G.

Revert 5b6e529521 to fix a no-boot regression which was triggered by
457ff1de2d ("lib/swiotlb.c: use memblock apis for early memory
allocations").

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27 21:02:38 -08:00
Santosh Shilimkar
5b6e529521 x86: memblock: set current limit to max low memory address
The memblock current limit value is used to limit early boot memory
allocations below max low memory address by default, as the kernel can
access only to the low memory.

Hence, set memblock current limit value to the max mapped low memory
address instead of max mapped memory address.

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Walmsley <paul@pwsan.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:46 -08:00
Linus Torvalds
f4bcd8ccdd Merge branch 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 kernel address space randomization support from Peter Anvin:
 "This enables kernel address space randomization for x86"

* 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, kaslr: Clarify RANDOMIZE_BASE_MAX_OFFSET
  x86, kaslr: Remove unused including <linux/version.h>
  x86, kaslr: Use char array to gain sizeof sanity
  x86, kaslr: Add a circular multiply for better bit diffusion
  x86, kaslr: Mix entropy sources together as needed
  x86/relocs: Add percpu fixup for GNU ld 2.23
  x86, boot: Rename get_flags() and check_flags() to *_cpuflags()
  x86, kaslr: Raise the maximum virtual address to -1 GiB on x86_64
  x86, kaslr: Report kernel offset on panic
  x86, kaslr: Select random position from e820 maps
  x86, kaslr: Provide randomness functions
  x86, kaslr: Return location from decompress_kernel
  x86, boot: Move CPU flags out of cpucheck
  x86, relocs: Add more per-cpu gold special cases
2014-01-20 14:45:50 -08:00
Linus Torvalds
2bb2c5e235 Merge branch 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loader updates from Ingo Molnar:
 "There are two main changes in this tree:

   - AMD microcode early loading fixes
   - some microcode loader source files reorganization"

* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, microcode: Move to a proper location
  x86, microcode, AMD: Fix early ucode loading
  x86, microcode: Share native MSR accessing variants
  x86, ramdisk: Export relocated ramdisk VA
2014-01-20 12:07:54 -08:00
Borislav Petkov
5aa3d718f2 x86, ramdisk: Export relocated ramdisk VA
The ramdisk can possibly get relocated if the whole image is not mapped.
And since we're going over it in the microcode loader and fishing out
the relevant microcode patches, we want access it at its new location.
Thus, export it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
2014-01-13 19:56:25 +01:00
Dave Young
77ea8c9489 x86: Reserve setup_data ranges late after parsing memmap cmdline
Currently e820_reserve_setup_data() is called before parsing early
params, it works in normal case. But for memmap=exactmap, the final
memory ranges are created after parsing memmap= cmdline params, so the
previous e820_reserve_setup_data() has no effect. For example,
setup_data ranges will still be marked as normal system ram, thus when
later sysfs driver ioremap them kernel will warn about mapping normal
ram.

This patch fix it by moving the e820_reserve_setup_data() callback after
parsing early params so they can be set as reserved ranges and later
ioremap will be fine with it.

Signed-off-by: Dave Young <dyoung@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-12-29 13:09:07 +00:00
Dave Young
1fec053369 x86/efi: Pass necessary EFI data for kexec via setup_data
Add a new setup_data type SETUP_EFI for kexec use.  Passing the saved
fw_vendor, runtime, config tables and EFI runtime mappings.

When entering virtual mode, directly mapping the EFI runtime regions
which we passed in previously. And skip the step to call
SetVirtualAddressMap().

Specially for HP z420 workstation we need save the smbios physical
address.  The kernel boot sequence proceeds in the following order.
Step 2 requires efi.smbios to be the physical address.  However, I found
that on HP z420 EFI system table has a virtual address of SMBIOS in step
1.  Hence, we need set it back to the physical address with the smbios
in efi_setup_data.  (When it is still the physical address, it simply
sets the same value.)

1. efi_init() - Set efi.smbios from EFI system table
2. dmi_scan_machine() - Temporary map efi.smbios to access SMBIOS table
3. efi_enter_virtual_mode() - Map EFI ranges

Tested on ovmf+qemu, lenovo thinkpad, a dell laptop and an
HP z420 workstation.

Signed-off-by: Dave Young <dyoung@redhat.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-12-29 13:09:05 +00:00
Tang Chen
fa591c4ae7 x86, acpi, crash, kdump: do reserve_crashkernel() after SRAT is parsed.
Memory reserved for crashkernel could be large.  So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know which memory is hotpluggable,
and we can avoid allocating this memory for the kernel.  So reorder
reserve_crashkernel() after SRAT is parsed.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:08 +09:00
Chen, Gong
dd6dad4288 DMI: Parse memory device (type 17) in SMBIOS
This patch adds a new interface to decode memory device (type 17)
to help error reporting on DIMMs.

Original-author: Tony Luck <tony.luck@intel.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-10-23 10:10:12 -07:00
Kees Cook
f32360ef66 x86, kaslr: Report kernel offset on panic
When the system panics, include the kernel offset in the report to assist
in debugging.

Signed-off-by: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/1381450698-28710-6-git-send-email-keescook@chromium.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-10-13 03:12:24 -07:00
Linus Torvalds
cb3e4330e6 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
 "Misc smaller fixes:

   - a parse_setup_data() boot crash fix

   - a memblock and an __early_ioremap cleanup

   - turn the always-on CONFIG_ARCH_MEMORY_PROBE=y into a configurable
     option and turn it off - it's an unrobust debug facility, it
     shouldn't be enabled by default"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: avoid remapping data in parse_setup_data()
  x86: Use memblock_set_current_limit() to set limit for memblock.
  mm: Remove unused variable idx0 in __early_ioremap()
  mm/hotplug, x86: Disable ARCH_MEMORY_PROBE by default
2013-09-04 09:39:26 -07:00
Linn Crosetto
30e46b574a x86: avoid remapping data in parse_setup_data()
Type SETUP_PCI, added by setup_efi_pci(), may advertise a ROM size
larger than early_memremap() is able to handle, which is currently
limited to 256kB. If this occurs it leads to a NULL dereference in
parse_setup_data().

To avoid this, remap the setup_data header and allow parsing functions
for individual types to handle their own data remapping.

Signed-off-by: Linn Crosetto <linn@hp.com>
Link: http://lkml.kernel.org/r/1376430401-67445-1-git-send-email-linn@hp.com
Acked-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-13 23:29:19 -07:00
Tang Chen
2449f343e4 x86: Use memblock_set_current_limit() to set limit for memblock.
In setup_arch() of x86, it set memblock.current_limit directly.
We should use memblock_set_current_limit(). If the implementation
is changed, it is easy to maintain.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/1376451844-15682-1-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-13 21:27:02 -07:00
Andi Kleen
277d5b40b7 x86, asmlinkage: Make several variables used from assembler/linker script visible
Plus one function, load_gs_index().

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1375740170-7446-10-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06 14:20:13 -07:00
Paul Gortmaker
148f9bb877 x86: delete __cpuinit usage from all x86 files
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications.  For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out.  Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit  -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings.  In any case, they are temporary and harmless.

This removes all the arch/x86 uses of the __cpuinit macros from
all C files.  x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-07-14 19:36:56 -04:00
Jiang Liu
46a841329a mm/x86: prepare for removing num_physpages and simplify mem_init()
Prepare for removing num_physpages and simplify mem_init().

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:38 -07:00
Tejun Heo
98e5e1bf72 dump_stack: implement arch-specific hardware description in task dumps
x86 and ia64 can acquire extra hardware identification information
from DMI and print it along with task dumps; however, the usage isn't
consistent.

* x86 show_regs() collects vendor, product and board strings and print
  them out with PID, comm and utsname.  Some of the information is
  printed again later in the same dump.

* warn_slowpath_common() explicitly accesses the DMI board and prints
  it out with "Hardware name:" label.  This applies to both x86 and
  ia64 but is irrelevant on all other archs.

* ia64 doesn't show DMI information on other non-WARN dumps.

This patch introduces arch-specific hardware description used by
dump_stack().  It can be set by calling dump_stack_set_arch_desc()
during boot and, if exists, printed out in a separate line with
"Hardware name:" label.

dmi_set_dump_stack_arch_desc() is added which sets arch-specific
description from DMI data.  It uses dmi_ids_string[] which is set from
dmi_present() used for DMI debug message.  It is superset of the
information x86 show_regs() is using.  The function is called from x86
and ia64 boot code right after dmi_scan_machine().

This makes the explicit DMI handling in warn_slowpath_common()
unnecessary.  Removed.

show_regs() isn't yet converted to use generic debug information
printing and this patch doesn't remove the duplicate DMI handling in
x86 show_regs().  The next patch will unify show_regs() handling and
remove the duplication.

An example WARN dump follows.

 WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #3
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
  0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
  ffffffff8108f500 ffffffff82228240 0000000000000040 ffffffff8234a08e
  0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
 Call Trace:
  [<ffffffff81c614dc>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  [<ffffffff8108f54a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8234a0c3>] init_workqueues+0x35/0x505
  ...

v2: Use the same string as the debug message from dmi_present() which
    also contains BIOS information.  Move hardware name into its own
    line as warn_slowpath_common() did.  This change was suggested by
    Bjorn Helgaas.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Linus Torvalds
74c7d2f520 Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform changes from Ingo Molnar:
 "Small fixes and cleanups all over the map"

* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/setup: Drop unneeded include <asm/dmi.h>
  x86/olpc/xo1/sci: Don't call input_free_device() after input_unregister_device()
  x86/platform/intel/mrst: Remove cast for kmalloc() return value
  x86/platform/uv: Replace kmalloc() & memset with kzalloc()
2013-04-30 08:42:11 -07:00
Linus Torvalds
df8edfa9af Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid changes from Ingo Molnar:
 "The biggest change is x86 CPU bug handling refactoring and cleanups,
  by Borislav Petkov"

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, CPU, AMD: Drop useless label
  x86, AMD: Correct {rd,wr}msr_amd_safe warnings
  x86: Fold-in trivial check_config function
  x86, cpu: Convert AMD Erratum 400
  x86, cpu: Convert AMD Erratum 383
  x86, cpu: Convert Cyrix coma bug detection
  x86, cpu: Convert FDIV bug detection
  x86, cpu: Convert F00F bug detection
  x86, cpu: Expand cpufeature facility to include cpu bugs
2013-04-30 08:34:38 -07:00
Jean Delvare
06d219dc22 x86/setup: Drop unneeded include <asm/dmi.h>
arch/x86/kernel/setup.c includes <asm/dmi.h> but it doesn't look
like it needs it, <linux/dmi.h> is sufficient.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Link: http://lkml.kernel.org/r/1366881845.4186.65.camel@chaos.site
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
2013-04-25 11:32:51 +02:00
Yinghai Lu
adbc742bf7 x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.

-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
	description in kernel-parameters.txt
     still keep crashkernel=X override any crashkernel=X,high
        crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
     checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
     also separate parse_suffix from parse_simper according to vivek.
	so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
     found by HTYAYAMA.

Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:33 -07:00
Yinghai Lu
55a20ee780 x86, kdump: Retore crashkernel= to allocate under 896M
Vivek found old kexec-tools does not work new kernel anymore.

So change back crashkernel= back to old behavoir, and add crashkernel_high=
to let user decide if buffer could be above 4G, and also new kexec-tools will
be needed.

-v2: let crashkernel=X override crashkernel_high=
    update description about _high will be ignored by crashkernel=X
-v3: update description about kernel-parameters.txt according to Vivek.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-4-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:33 -07:00
Yinghai Lu
c729de8fce x86, kdump: Set crashkernel_low automatically
Chao said that kdump does does work well on his system on 3.8
without extra parameter, even iommu does not work with kdump.
And now have to append crashkernel_low=Y in first kernel to make
kdump work.

We have now modified crashkernel=X to allocate memory beyong 4G (if
available) and do not allocate low range for crashkernel if the user
does not specify that with crashkernel_low=Y.  This causes regression
if iommu is not enabled.  Without iommu, swiotlb needs to be setup in
first 4G and there is no low memory available to second kernel.

Set crashkernel_low automatically if the user does not specify that.

For system that does support IOMMU with kdump properly, user could
specify crashkernel_low=0 to save that 72M low ram.

-v3: add swiotlb_size() according to Konrad.
-v4: add comments what 8M is for according to hpa.
     also update more crashkernel_low= in kernel-parameters.txt
-v5: update changelog according to Vivek.
-v6: Change description about swiotlb referring according to HATAYAMA.

Reported-by: WANG Chao <chaowang@redhat.com>
Tested-by: WANG Chao <chaowang@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-2-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:32 -07:00
Borislav Petkov
93a829e8e2 x86, cpu: Convert FDIV bug detection
... to the new facility. Add a reference to the wikipedia article
explaining the FDIV test we're doing here.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1363788448-31325-4-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-04-02 10:12:53 -07:00
Krzysztof Mazur
015221fefb x86: Fix 32-bit *_cpu_data initializers
The commit 27be457000
('x86 idle: remove 32-bit-only "no-hlt" parameter, hlt_works_ok
flag') removed the hlt_works_ok flag from struct cpuinfo_x86, but
boot_cpu_data and new_cpu_data initializers were not changed
causing setting f00f_bug flag, instead of fdiv_bug.

If CONFIG_X86_F00F_BUG is not set the f00f_bug flag is never
cleared.

To avoid such problems in future C99-style initialization is now
used.

Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: len.brown@intel.com
Link: http://lkml.kernel.org/r/1362266082-2227-1-git-send-email-krzysiek@podlesie.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-03-06 20:15:50 -08:00
Yinghai Lu
20e6926dcb x86, ACPI, mm: Revert movablemem_map support
Tim found:

  WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
  Hardware name: S2600CP
  sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
  smpboot: Booting Node   1, Processors  #1
  Modules linked in:
  Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
  Call Trace:
    set_cpu_sibling_map+0x279/0x449
    start_secondary+0x11d/0x1e5

Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")

It turns out movable_map has some problems, and it breaks several things

1. numa_init is called several times, NOT just for srat. so those
	nodes_clear(numa_nodes_parsed)
	memset(&numa_meminfo, 0, sizeof(numa_meminfo))
   can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
   and make fall back path working.

2. simply split acpi_numa_init to early_parse_srat.
   a. that early_parse_srat is NOT called for ia64, so you break ia64.
   b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
	     set_apicid_to_node(i, NUMA_NO_NODE)
     still left in numa_init. So it will just clear result from early_parse_srat.
     it should be moved before that....
   c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
       early before override from INITRD is settled.

3. that patch TITLE is total misleading, there is NO x86 in the title,
   but it changes critical x86 code. It caused x86 guys did not
   pay attention to find the problem early. Those patches really should
   be routed via tip/x86/mm.

4. after that commit, following range can not use movable ram:
  a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
  b. initrd... it will be freed after booting, so it could be on movable...
  c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
	anymore.
  d. init_mem_mapping: can not put page table high anymore.
  e. initmem_init: vmemmap can not be high local node anymore. That is
     not good.

If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.

We have workaround patch that could fix some problems, but some can not
be fixed.

So just remove that offending commit and related ones including:

 f7210e6c4a ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
    protect movablecore_map in memblock_overlaps_region().")

 01a178a94e ("acpi, memory-hotplug: support getting hotplug info from
    SRAT")

 27168d38fa ("acpi, memory-hotplug: extend movablemem_map ranges to
    the end of node")

 e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock is
    ready")

 fb06bc8e5f ("page_alloc: bootmem limit with movablecore_map")

 42f47e27e7 ("page_alloc: make movablemem_map have higher priority")

 6981ec3114 ("page_alloc: introduce zone_movable_limit[] to keep
    movable limit for nodes")

 34b71f1e04 ("page_alloc: add movable_memmap kernel parameter")

 4d59a75125 ("x86: get pg_data_t's memory from other node")

Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0.  Also
need to find way to put other kernel used ram to local node ram.

Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-02 09:34:39 -08:00
Linus Torvalds
e3c4877de8 Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/EFI changes from Peter Anvin:

 - Improve the initrd handling in the EFI boot stub by allowing forward
   slashes in the pathname - from Chun-Yi Lee.

 - Cleanup code duplication in the EFI mixed kernel/firmware code - from
   Satoru Takeuchi.

 - efivarfs bug fixes for more strict filename validation, with lots of
   input from Al Viro.

* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, efi: remove duplicate code in setup_arch() by using, efi_is_native()
  efivarfs: guid part of filenames are case-insensitive
  efivarfs: Validate filenames much more aggressively
  efivarfs: Use sizeof() instead of magic number
  x86, efi: Allow slash in file path of initrd
2013-02-27 16:17:42 -08:00
Tang Chen
e8d1955258 acpi, memory-hotplug: parse SRAT before memblock is ready
On linux, the pages used by kernel could not be migrated.  As a result,
if a memory range is used by kernel, it cannot be hot-removed.  So if we
want to hot-remove memory, we should prevent kernel from using it.

The way now used to prevent this is specify a memory range by
movablemem_map boot option and set it as ZONE_MOVABLE.

But when the system is booting, memblock will allocate memory, and
reserve the memory for kernel.  And before we parse SRAT, and know the
node memory ranges, memblock is working.  And it may allocate memory in
ranges to be set as ZONE_MOVABLE.  This memory can be used by kernel,
and never be freed.

So, let's parse SRAT before memblock is called first.  And it is early
enough.

The first call of memblock_find_in_range_node() is in:

  setup_arch()
    |-->setup_real_mode()

so, this patch add a function early_parse_srat() to parse SRAT, and call
it before setup_real_mode() is called.

NOTE:

1) early_parse_srat() is called before numa_init(), and has initialized
   numa_meminfo.  So DO NOT clear numa_nodes_parsed in numa_init() and DO
   NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
   numa info.

2) I don't know why using count of memory affinities parsed from SRAT
   as a return value in original acpi_numa_init().  So I add a static
   variable srat_mem_cnt to remember this count and use it as the return
   value of the new acpi_numa_init()

[mhocko@suse.cz: parse SRAT before memblock is ready fix]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:14 -08:00
Linus Torvalds
2ef14f465b Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Peter Anvin:
 "This is a huge set of several partly interrelated (and concurrently
  developed) changes, which is why the branch history is messier than
  one would like.

  The *really* big items are two humonguous patchsets mostly developed
  by Yinghai Lu at my request, which completely revamps the way we
  create initial page tables.  In particular, rather than estimating how
  much memory we will need for page tables and then build them into that
  memory -- a calculation that has shown to be incredibly fragile -- we
  now build them (on 64 bits) with the aid of a "pseudo-linear mode" --
  a #PF handler which creates temporary page tables on demand.

  This has several advantages:

  1. It makes it much easier to support things that need access to data
     very early (a followon patchset uses this to load microcode way
     early in the kernel startup).

  2. It allows the kernel and all the kernel data objects to be invoked
     from above the 4 GB limit.  This allows kdump to work on very large
     systems.

  3. It greatly reduces the difference between Xen and native (Xen's
     equivalent of the #PF handler are the temporary page tables created
     by the domain builder), eliminating a bunch of fragile hooks.

  The patch series also gets us a bit closer to W^X.

  Additional work in this pull is the 64-bit get_user() work which you
  were also involved with, and a bunch of cleanups/speedups to
  __phys_addr()/__pa()."

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (105 commits)
  x86, mm: Move reserving low memory later in initialization
  x86, doc: Clarify the use of asm("%edx") in uaccess.h
  x86, mm: Redesign get_user with a __builtin_choose_expr hack
  x86: Be consistent with data size in getuser.S
  x86, mm: Use a bitfield to mask nuisance get_user() warnings
  x86/kvm: Fix compile warning in kvm_register_steal_time()
  x86-32: Add support for 64bit get_user()
  x86-32, mm: Remove reference to alloc_remap()
  x86-32, mm: Remove reference to resume_map_numa_kva()
  x86-32, mm: Rip out x86_32 NUMA remapping code
  x86/numa: Use __pa_nodebug() instead
  x86: Don't panic if can not alloc buffer for swiotlb
  mm: Add alloc_bootmem_low_pages_nopanic()
  x86, 64bit, mm: hibernate use generic mapping_init
  x86, 64bit, mm: Mark data/bss/brk to nx
  x86: Merge early kernel reserve for 32bit and 64bit
  x86: Add Crash kernel low reservation
  x86, kdump: Remove crashkernel range find limit for 64bit
  memblock: Add memblock_mem_size()
  x86, boot: Not need to check setup_header version for setup_data
  ...
2013-02-21 18:06:55 -08:00
H. Peter Anvin
0da3e7f526 Merge branch 'x86/mm2' into x86/mm
x86/mm2 is testing out fine, but has developed conflicts with x86/mm
due to patches in adjacent code.  Merge them so we can drop x86/mm2
and have a unified branch.

Resolved Conflicts:
	arch/x86/kernel/setup.c
2013-02-15 09:25:08 -08:00
H. Peter Anvin
95c9608478 x86, mm: Move reserving low memory later in initialization
Move the reservation of low memory, except for the 4K which actually
does belong to the BIOS, later in the initialization; in particular,
after we have already reserved the trampoline.

The current code locates the trampoline as high as possible, so by
deferring the allocation we will still be able to reserve as much
memory as is possible.  This allows us to run with reservelow=640k
without getting a crash on system startup.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-0y9dqmmsousf69wutxwl3kkf@git.kernel.org
2013-02-14 15:21:25 -08:00
Satoru Takeuchi
6b59e366e0 x86, efi: remove duplicate code in setup_arch() by using, efi_is_native()
The check, "IS_ENABLED(CONFIG_X86_64) != efi_enabled(EFI_64BIT)",
in setup_arch() can be replaced by efi_is_enabled(). This change
remove duplicate code and improve readability.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Olof Johansson <olof@lixom.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-02-14 10:36:18 +00:00
H. Peter Anvin
68d00bbebb Merge remote-tracking branch 'origin/x86/mm' into x86/mm2
Explicitly merging these two branches due to nontrivial conflicts and
to allow further work.

Resolved Conflicts:
	arch/x86/kernel/head32.c
	arch/x86/kernel/head64.c
	arch/x86/mm/init_64.c
	arch/x86/realmode/init.c

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-02-01 02:28:36 -08:00
Matt Fleming
83e6818974 efi: Make 'efi_enabled' a function to query EFI facilities
Originally 'efi_enabled' indicated whether a kernel was booted from
EFI firmware. Over time its semantics have changed, and it now
indicates whether or not we are booted on an EFI machine with
bit-native firmware, e.g. 64-bit kernel with 64-bit firmware.

The immediate motivation for this patch is the bug report at,

    https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557

which details how running a platform driver on an EFI machine that is
designed to run under BIOS can cause the machine to become
bricked. Also, the following report,

    https://bugzilla.kernel.org/show_bug.cgi?id=47121

details how running said driver can also cause Machine Check
Exceptions. Drivers need a new means of detecting whether they're
running on an EFI machine, as sadly the expression,

    if (!efi_enabled)

hasn't been a sufficient condition for quite some time.

Users actually want to query 'efi_enabled' for different reasons -
what they really want access to is the list of available EFI
facilities.

For instance, the x86 reboot code needs to know whether it can invoke
the ResetSystem() function provided by the EFI runtime services, while
the ACPI OSL code wants to know whether the EFI config tables were
mapped successfully. There are also checks in some of the platform
driver code to simply see if they're running on an EFI machine (which
would make it a bad idea to do BIOS-y things).

This patch is a prereq for the samsung-laptop fix patch.

Cc: David Airlie <airlied@linux.ie>
Cc: Corentin Chary <corentincj@iksaif.net>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Peter Jones <pjones@redhat.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Steve Langasek <steve.langasek@canonical.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-30 11:51:59 -08:00
Yinghai Lu
6c902b656c x86: Merge early kernel reserve for 32bit and 64bit
They are the same, and we could move them out from head32/64.c to setup.c.

We are using memblock, and it could handle overlapping properly, so
we don't need to reserve some at first to hold the location, and just
need to make sure we reserve them before we are using memblock to find
free mem to use.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-32-git-send-email-yinghai@kernel.org
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 19:32:58 -08:00
Yinghai Lu
0212f91596 x86: Add Crash kernel low reservation
During kdump kernel's booting stage, it need to find low ram for
swiotlb buffer when system does not support intel iommu/dmar remapping.

kexed-tools is appending memmap=exactmap and range from /proc/iomem
with "Crash kernel", and that range is above 4G for 64bit after boot
protocol 2.12.

We need to add another range in /proc/iomem like "Crash kernel low",
so kexec-tools could find that info and append to kdump kernel
command line.

Try to reserve some under 4G if the normal "Crash kernel" is above 4G.

User could specify the size with crashkernel_low=XX[KMG].

-v2: fix warning that is found by Fengguang's test robot.
-v3: move out get_mem_size change to another patch, to solve compiling
     warning that is found by Borislav Petkov <bp@alien8.de>
-v4: user must specify crashkernel_low if system does not support
     intel or amd iommu.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-31-git-send-email-yinghai@kernel.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Rob Landley <rob@landley.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 19:32:58 -08:00
Yinghai Lu
7d41a8a4a2 x86, kdump: Remove crashkernel range find limit for 64bit
Now kexeced kernel/ramdisk could be above 4g, so remove 896 limit for
64bit.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-30-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 19:32:58 -08:00
Yinghai Lu
595ad9af85 memblock: Add memblock_mem_size()
Use it to get mem size under the limit_pfn.
to replace local version in x86 reserved_initrd.

-v2: remove not needed cast that is pointed out by HPA.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-29-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 19:32:57 -08:00
Yinghai Lu
d1af6d045f x86, boot: Not need to check setup_header version for setup_data
That is for bootloaders.

setup_data is in setup_header, and bootloader is copying that from bzImage.
So for old bootloader should keep that as 0 already.

old kexec-tools till now for elf image set setup_data to 0, so it is ok.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-28-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 19:32:57 -08:00
Yinghai Lu
ee92d81502 x86, boot: Support loading bzImage, boot_params and ramdisk above 4G
xloadflags bit 1 indicates that we can load the kernel and all data
structures above 4G; it is set if kernel is relocatable and 64bit.

bootloader will check if xloadflags bit 1 is set to decide if
it could load ramdisk and kernel high above 4G.

bootloader will fill value to ext_ramdisk_image/size for high 32bits
when it load ramdisk above 4G.
kernel use get_ramdisk_image/size to use ext_ramdisk_image/size to get
right positon for ramdisk.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Rob Landley <rob@landley.net>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Gokul Caushik <caushik1@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joe Millenbach <jmillenbach@gmail.com>
Link: http://lkml.kernel.org/r/1359058816-7615-26-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 19:32:33 -08:00
Yinghai Lu
a8a51a88d5 x86: Add get_ramdisk_image/size()
There are several places to find ramdisk information early for reserving
and relocating.

Use accessor functions to make code more readable and consistent.

Later will add ext_ramdisk_image/size in those functions to support
loading ramdisk above 4g.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-16-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:21:05 -08:00
Yinghai Lu
1b8c78be01 x86: Merge early_reserve_initrd for 32bit and 64bit
They are the same, could move them out from head32/64.c to setup.c.

We are using memblock, and it could handle overlapping properly, so
we don't need to reserve some at first to hold the location, and just
need to make sure we reserve them before we are using memblock to find
free mem to use.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-15-git-send-email-yinghai@kernel.org
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:20:41 -08:00
Yinghai Lu
100542306f x86, 64bit: Don't set max_pfn_mapped wrong value early on native path
We are not having max_pfn_mapped set correctly until init_memory_mapping.
So don't print its initial value for 64bit

Also need to use KERNEL_IMAGE_SIZE directly for highmap cleanup.

-v2: update comments about max_pfn_mapped according to Stefano Stabellini.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-14-git-send-email-yinghai@kernel.org
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:20:16 -08:00
H. Peter Anvin
8170e6bed4 x86, 64bit: Use a #PF handler to materialize early mappings on demand
Linear mode (CR0.PG = 0) is mutually exclusive with 64-bit mode; all
64-bit code has to use page tables.  This makes it awkward before we
have first set up properly all-covering page tables to access objects
that are outside the static kernel range.

So far we have dealt with that simply by mapping a fixed amount of
low memory, but that fails in at least two upcoming use cases:

1. We will support load and run kernel, struct boot_params, ramdisk,
   command line, etc. above the 4 GiB mark.
2. need to access ramdisk early to get microcode to update that as
   early possible.

We could use early_iomap to access them too, but it will make code to
messy and hard to be unified with 32 bit.

Hence, set up a #PF table and use a fixed number of buffers to set up
page tables on demand.  If the buffers fill up then we simply flush
them and start over.  These buffers are all in __initdata, so it does
not increase RAM usage at runtime.

Thus, with the help of the #PF handler, we can set the final kernel
mapping from blank, and switch to init_level4_pgt later.

During the switchover in head_64.S, before #PF handler is available,
we use three pages to handle kernel crossing 1G, 512G boundaries with
sharing page by playing games with page aliasing: the same page is
mapped twice in the higher-level tables with appropriate wraparound.
The kernel region itself will be properly mapped; other mappings may
be spurious.

early_make_pgtable is using kernel high mapping address to access pages
to set page table.

-v4: Add phys_base offset to make kexec happy, and add
	init_mapping_kernel()   - Yinghai
-v5: fix compiling with xen, and add back ident level3 and level2 for xen
     also move back init_level4_pgt from BSS to DATA again.
     because we have to clear it anyway.  - Yinghai
-v6: switch to init_level4_pgt in init_mem_mapping. - Yinghai
-v7: remove not needed clear_page for init_level4_page
     it is with fill 512,8,0 already in head_64.S  - Yinghai
-v8: we need to keep that handler alive until init_mem_mapping and don't
     let early_trap_init to trash that early #PF handler.
     So split early_trap_pf_init out and move it down. - Yinghai
-v9: switchover only cover kernel space instead of 1G so could avoid
     touch possible mem holes. - Yinghai
-v11: change far jmp back to far return to initial_code, that is needed
     to fix failure that is reported by Konrad on AMD systems.  - Yinghai

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-12-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:20:06 -08:00
Yinghai Lu
4f7b92263a x86, realmode: Separate real_mode reserve and setup
After we switch to use #PF handler help to set page table, init_level4_pgt
will only have entries set after init_mem_mapping().
We need to move copying init_level4_pgt to trampoline_pgd after that.

So split reserve and setup, and move the setup after init_mem_mapping()

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-11-git-send-email-yinghai@kernel.org
Cc: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:13:24 -08:00
Yinghai Lu
b422a30917 x86: Factor out e820_add_kernel_range()
Separate out the reservation of the kernel static memory areas into a
separate function.

Also add support for case when memmap=xxM$yyM is used without exactmap.
Need to remove reserved range at first before we add E820_RAM
range, otherwise added E820_RAM range will be ignored.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-5-git-send-email-yinghai@kernel.org
Cc: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:12:24 -08:00