mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	 0fe0965e63
			
		
	
	
		0fe0965e63
		
	
	
	
	
		
			
			The IRQSTACKSIZE was renamed to the IRQ_STACK_SIZE in the
(26f80bd6a9 x86-64: Convert irqstacks to per-cpu) commit,
but it still named IRQSTACKSIZE in the documentation. This
patch fixes this.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
		
	
			
		
			
				
	
	
		
			142 lines
		
	
	
		
			6.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			142 lines
		
	
	
		
			6.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Kernel stacks on x86-64 bit
 | |
| ---------------------------
 | |
| 
 | |
| Most of the text from Keith Owens, hacked by AK
 | |
| 
 | |
| x86_64 page size (PAGE_SIZE) is 4K.
 | |
| 
 | |
| Like all other architectures, x86_64 has a kernel stack for every
 | |
| active thread.  These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
 | |
| These stacks contain useful data as long as a thread is alive or a
 | |
| zombie. While the thread is in user space the kernel stack is empty
 | |
| except for the thread_info structure at the bottom.
 | |
| 
 | |
| In addition to the per thread stacks, there are specialized stacks
 | |
| associated with each CPU.  These stacks are only used while the kernel
 | |
| is in control on that CPU; when a CPU returns to user space the
 | |
| specialized stacks contain no useful data.  The main CPU stacks are:
 | |
| 
 | |
| * Interrupt stack.  IRQ_STACK_SIZE
 | |
| 
 | |
|   Used for external hardware interrupts.  If this is the first external
 | |
|   hardware interrupt (i.e. not a nested hardware interrupt) then the
 | |
|   kernel switches from the current task to the interrupt stack.  Like
 | |
|   the split thread and interrupt stacks on i386, this gives more room
 | |
|   for kernel interrupt processing without having to increase the size
 | |
|   of every per thread stack.
 | |
| 
 | |
|   The interrupt stack is also used when processing a softirq.
 | |
| 
 | |
| Switching to the kernel interrupt stack is done by software based on a
 | |
| per CPU interrupt nest counter. This is needed because x86-64 "IST"
 | |
| hardware stacks cannot nest without races.
 | |
| 
 | |
| x86_64 also has a feature which is not available on i386, the ability
 | |
| to automatically switch to a new stack for designated events such as
 | |
| double fault or NMI, which makes it easier to handle these unusual
 | |
| events on x86_64.  This feature is called the Interrupt Stack Table
 | |
| (IST).  There can be up to 7 IST entries per CPU. The IST code is an
 | |
| index into the Task State Segment (TSS). The IST entries in the TSS
 | |
| point to dedicated stacks; each stack can be a different size.
 | |
| 
 | |
| An IST is selected by a non-zero value in the IST field of an
 | |
| interrupt-gate descriptor.  When an interrupt occurs and the hardware
 | |
| loads such a descriptor, the hardware automatically sets the new stack
 | |
| pointer based on the IST value, then invokes the interrupt handler.  If
 | |
| the interrupt came from user mode, then the interrupt handler prologue
 | |
| will switch back to the per-thread stack.  If software wants to allow
 | |
| nested IST interrupts then the handler must adjust the IST values on
 | |
| entry to and exit from the interrupt handler.  (This is occasionally
 | |
| done, e.g. for debug exceptions.)
 | |
| 
 | |
| Events with different IST codes (i.e. with different stacks) can be
 | |
| nested.  For example, a debug interrupt can safely be interrupted by an
 | |
| NMI.  arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
 | |
| pointers on entry to and exit from all IST events, in theory allowing
 | |
| IST events with the same code to be nested.  However in most cases, the
 | |
| stack size allocated to an IST assumes no nesting for the same code.
 | |
| If that assumption is ever broken then the stacks will become corrupt.
 | |
| 
 | |
| The currently assigned IST stacks are :-
 | |
| 
 | |
| * DOUBLEFAULT_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
 | |
| 
 | |
|   Used for interrupt 8 - Double Fault Exception (#DF).
 | |
| 
 | |
|   Invoked when handling one exception causes another exception. Happens
 | |
|   when the kernel is very confused (e.g. kernel stack pointer corrupt).
 | |
|   Using a separate stack allows the kernel to recover from it well enough
 | |
|   in many cases to still output an oops.
 | |
| 
 | |
| * NMI_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
 | |
| 
 | |
|   Used for non-maskable interrupts (NMI).
 | |
| 
 | |
|   NMI can be delivered at any time, including when the kernel is in the
 | |
|   middle of switching stacks.  Using IST for NMI events avoids making
 | |
|   assumptions about the previous state of the kernel stack.
 | |
| 
 | |
| * DEBUG_STACK.  DEBUG_STKSZ
 | |
| 
 | |
|   Used for hardware debug interrupts (interrupt 1) and for software
 | |
|   debug interrupts (INT3).
 | |
| 
 | |
|   When debugging a kernel, debug interrupts (both hardware and
 | |
|   software) can occur at any time.  Using IST for these interrupts
 | |
|   avoids making assumptions about the previous state of the kernel
 | |
|   stack.
 | |
| 
 | |
| * MCE_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
 | |
| 
 | |
|   Used for interrupt 18 - Machine Check Exception (#MC).
 | |
| 
 | |
|   MCE can be delivered at any time, including when the kernel is in the
 | |
|   middle of switching stacks.  Using IST for MCE events avoids making
 | |
|   assumptions about the previous state of the kernel stack.
 | |
| 
 | |
| For more details see the Intel IA32 or AMD AMD64 architecture manuals.
 | |
| 
 | |
| 
 | |
| Printing backtraces on x86
 | |
| --------------------------
 | |
| 
 | |
| The question about the '?' preceding function names in an x86 stacktrace
 | |
| keeps popping up, here's an indepth explanation. It helps if the reader
 | |
| stares at print_context_stack() and the whole machinery in and around
 | |
| arch/x86/kernel/dumpstack.c.
 | |
| 
 | |
| Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>:
 | |
| 
 | |
| We always scan the full kernel stack for return addresses stored on
 | |
| the kernel stack(s) [*], from stack top to stack bottom, and print out
 | |
| anything that 'looks like' a kernel text address.
 | |
| 
 | |
| If it fits into the frame pointer chain, we print it without a question
 | |
| mark, knowing that it's part of the real backtrace.
 | |
| 
 | |
| If the address does not fit into our expected frame pointer chain we
 | |
| still print it, but we print a '?'. It can mean two things:
 | |
| 
 | |
|  - either the address is not part of the call chain: it's just stale
 | |
|    values on the kernel stack, from earlier function calls. This is
 | |
|    the common case.
 | |
| 
 | |
|  - or it is part of the call chain, but the frame pointer was not set
 | |
|    up properly within the function, so we don't recognize it.
 | |
| 
 | |
| This way we will always print out the real call chain (plus a few more
 | |
| entries), regardless of whether the frame pointer was set up correctly
 | |
| or not - but in most cases we'll get the call chain right as well. The
 | |
| entries printed are strictly in stack order, so you can deduce more
 | |
| information from that as well.
 | |
| 
 | |
| The most important property of this method is that we _never_ lose
 | |
| information: we always strive to print _all_ addresses on the stack(s)
 | |
| that look like kernel text addresses, so if debug information is wrong,
 | |
| we still print out the real call chain as well - just with more question
 | |
| marks than ideal.
 | |
| 
 | |
| [*] For things like IRQ and IST stacks, we also scan those stacks, in
 | |
|     the right order, and try to cross from one stack into another
 | |
|     reconstructing the call chain. This works most of the time.
 |