mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	 1975dbc276
			
		
	
	
		1975dbc276
		
	
	
	
	
		
			
			Fix a few small mistakes in the static key documentation and delete an unneeded sentence. Suggested-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150914171105.511e1e21@lwn.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
		
			
				
	
	
		
			292 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			292 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| 			Static Keys
 | |
| 			-----------
 | |
| 
 | |
| DEPRECATED API:
 | |
| 
 | |
| The use of 'struct static_key' directly, is now DEPRECATED. In addition
 | |
| static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following:
 | |
| 
 | |
| struct static_key false = STATIC_KEY_INIT_FALSE;
 | |
| struct static_key true = STATIC_KEY_INIT_TRUE;
 | |
| static_key_true()
 | |
| static_key_false()
 | |
| 
 | |
| The updated API replacements are:
 | |
| 
 | |
| DEFINE_STATIC_KEY_TRUE(key);
 | |
| DEFINE_STATIC_KEY_FALSE(key);
 | |
| static_branch_likely()
 | |
| static_branch_unlikely()
 | |
| 
 | |
| 0) Abstract
 | |
| 
 | |
| Static keys allows the inclusion of seldom used features in
 | |
| performance-sensitive fast-path kernel code, via a GCC feature and a code
 | |
| patching technique. A quick example:
 | |
| 
 | |
| 	DEFINE_STATIC_KEY_FALSE(key);
 | |
| 
 | |
| 	...
 | |
| 
 | |
|         if (static_branch_unlikely(&key))
 | |
|                 do unlikely code
 | |
|         else
 | |
|                 do likely code
 | |
| 
 | |
| 	...
 | |
| 	static_branch_enable(&key);
 | |
| 	...
 | |
| 	static_branch_disable(&key);
 | |
| 	...
 | |
| 
 | |
| The static_branch_unlikely() branch will be generated into the code with as little
 | |
| impact to the likely code path as possible.
 | |
| 
 | |
| 
 | |
| 1) Motivation
 | |
| 
 | |
| 
 | |
| Currently, tracepoints are implemented using a conditional branch. The
 | |
| conditional check requires checking a global variable for each tracepoint.
 | |
| Although the overhead of this check is small, it increases when the memory
 | |
| cache comes under pressure (memory cache lines for these global variables may
 | |
| be shared with other memory accesses). As we increase the number of tracepoints
 | |
| in the kernel this overhead may become more of an issue. In addition,
 | |
| tracepoints are often dormant (disabled) and provide no direct kernel
 | |
| functionality. Thus, it is highly desirable to reduce their impact as much as
 | |
| possible. Although tracepoints are the original motivation for this work, other
 | |
| kernel code paths should be able to make use of the static keys facility.
 | |
| 
 | |
| 
 | |
| 2) Solution
 | |
| 
 | |
| 
 | |
| gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:
 | |
| 
 | |
| http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
 | |
| 
 | |
| Using the 'asm goto', we can create branches that are either taken or not taken
 | |
| by default, without the need to check memory. Then, at run-time, we can patch
 | |
| the branch site to change the branch direction.
 | |
| 
 | |
| For example, if we have a simple branch that is disabled by default:
 | |
| 
 | |
| 	if (static_branch_unlikely(&key))
 | |
| 		printk("I am the true branch\n");
 | |
| 
 | |
| Thus, by default the 'printk' will not be emitted. And the code generated will
 | |
| consist of a single atomic 'no-op' instruction (5 bytes on x86), in the
 | |
| straight-line code path. When the branch is 'flipped', we will patch the
 | |
| 'no-op' in the straight-line codepath with a 'jump' instruction to the
 | |
| out-of-line true branch. Thus, changing branch direction is expensive but
 | |
| branch selection is basically 'free'. That is the basic tradeoff of this
 | |
| optimization.
 | |
| 
 | |
| This lowlevel patching mechanism is called 'jump label patching', and it gives
 | |
| the basis for the static keys facility.
 | |
| 
 | |
| 3) Static key label API, usage and examples:
 | |
| 
 | |
| 
 | |
| In order to make use of this optimization you must first define a key:
 | |
| 
 | |
| 	DEFINE_STATIC_KEY_TRUE(key);
 | |
| 
 | |
| or:
 | |
| 
 | |
| 	DEFINE_STATIC_KEY_FALSE(key);
 | |
| 
 | |
| 
 | |
| The key must be global, that is, it can't be allocated on the stack or dynamically
 | |
| allocated at run-time.
 | |
| 
 | |
| The key is then used in code as:
 | |
| 
 | |
|         if (static_branch_unlikely(&key))
 | |
|                 do unlikely code
 | |
|         else
 | |
|                 do likely code
 | |
| 
 | |
| Or:
 | |
| 
 | |
|         if (static_branch_likely(&key))
 | |
|                 do likely code
 | |
|         else
 | |
|                 do unlikely code
 | |
| 
 | |
| Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may
 | |
| be used in either static_branch_likely() or static_branch_unlikely()
 | |
| statemnts.
 | |
| 
 | |
| Branch(es) can be set true via:
 | |
| 
 | |
| static_branch_enable(&key);
 | |
| 
 | |
| or false via:
 | |
| 
 | |
| static_branch_disable(&key);
 | |
| 
 | |
| The branch(es) can then be switched via reference counts:
 | |
| 
 | |
| 	static_branch_inc(&key);
 | |
| 	...
 | |
| 	static_branch_dec(&key);
 | |
| 
 | |
| Thus, 'static_branch_inc()' means 'make the branch true', and
 | |
| 'static_branch_dec()' means 'make the branch false' with appropriate
 | |
| reference counting. For example, if the key is initialized true, a
 | |
| static_branch_dec(), will switch the branch to false. And a subsequent
 | |
| static_branch_inc(), will change the branch back to true. Likewise, if the
 | |
| key is initialized false, a 'static_branch_inc()', will change the branch to
 | |
| true. And then a 'static_branch_dec()', will again make the branch false.
 | |
| 
 | |
| 
 | |
| 4) Architecture level code patching interface, 'jump labels'
 | |
| 
 | |
| 
 | |
| There are a few functions and macros that architectures must implement in order
 | |
| to take advantage of this optimization. If there is no architecture support, we
 | |
| simply fall back to a traditional, load, test, and jump sequence.
 | |
| 
 | |
| * select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig
 | |
| 
 | |
| * #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h
 | |
| 
 | |
| * __always_inline bool arch_static_branch(struct static_key *key, bool branch), see:
 | |
| 					arch/x86/include/asm/jump_label.h
 | |
| 
 | |
| * __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch),
 | |
| 					see: arch/x86/include/asm/jump_label.h
 | |
| 
 | |
| * void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type),
 | |
| 					see: arch/x86/kernel/jump_label.c
 | |
| 
 | |
| * __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type),
 | |
| 					see: arch/x86/kernel/jump_label.c
 | |
| 
 | |
| 
 | |
| * struct jump_entry, see: arch/x86/include/asm/jump_label.h
 | |
| 
 | |
| 
 | |
| 5) Static keys / jump label analysis, results (x86_64):
 | |
| 
 | |
| 
 | |
| As an example, let's add the following branch to 'getppid()', such that the
 | |
| system call now looks like:
 | |
| 
 | |
| SYSCALL_DEFINE0(getppid)
 | |
| {
 | |
|         int pid;
 | |
| 
 | |
| +       if (static_branch_unlikely(&key))
 | |
| +               printk("I am the true branch\n");
 | |
| 
 | |
|         rcu_read_lock();
 | |
|         pid = task_tgid_vnr(rcu_dereference(current->real_parent));
 | |
|         rcu_read_unlock();
 | |
| 
 | |
|         return pid;
 | |
| }
 | |
| 
 | |
| The resulting instructions with jump labels generated by GCC is:
 | |
| 
 | |
| ffffffff81044290 <sys_getppid>:
 | |
| ffffffff81044290:       55                      push   %rbp
 | |
| ffffffff81044291:       48 89 e5                mov    %rsp,%rbp
 | |
| ffffffff81044294:       e9 00 00 00 00          jmpq   ffffffff81044299 <sys_getppid+0x9>
 | |
| ffffffff81044299:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
 | |
| ffffffff810442a0:       00 00
 | |
| ffffffff810442a2:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
 | |
| ffffffff810442a9:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
 | |
| ffffffff810442b0:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
 | |
| ffffffff810442b7:       e8 f4 d9 00 00          callq  ffffffff81051cb0 <pid_vnr>
 | |
| ffffffff810442bc:       5d                      pop    %rbp
 | |
| ffffffff810442bd:       48 98                   cltq
 | |
| ffffffff810442bf:       c3                      retq
 | |
| ffffffff810442c0:       48 c7 c7 e3 54 98 81    mov    $0xffffffff819854e3,%rdi
 | |
| ffffffff810442c7:       31 c0                   xor    %eax,%eax
 | |
| ffffffff810442c9:       e8 71 13 6d 00          callq  ffffffff8171563f <printk>
 | |
| ffffffff810442ce:       eb c9                   jmp    ffffffff81044299 <sys_getppid+0x9>
 | |
| 
 | |
| Without the jump label optimization it looks like:
 | |
| 
 | |
| ffffffff810441f0 <sys_getppid>:
 | |
| ffffffff810441f0:       8b 05 8a 52 d8 00       mov    0xd8528a(%rip),%eax        # ffffffff81dc9480 <key>
 | |
| ffffffff810441f6:       55                      push   %rbp
 | |
| ffffffff810441f7:       48 89 e5                mov    %rsp,%rbp
 | |
| ffffffff810441fa:       85 c0                   test   %eax,%eax
 | |
| ffffffff810441fc:       75 27                   jne    ffffffff81044225 <sys_getppid+0x35>
 | |
| ffffffff810441fe:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
 | |
| ffffffff81044205:       00 00
 | |
| ffffffff81044207:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
 | |
| ffffffff8104420e:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
 | |
| ffffffff81044215:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
 | |
| ffffffff8104421c:       e8 2f da 00 00          callq  ffffffff81051c50 <pid_vnr>
 | |
| ffffffff81044221:       5d                      pop    %rbp
 | |
| ffffffff81044222:       48 98                   cltq
 | |
| ffffffff81044224:       c3                      retq
 | |
| ffffffff81044225:       48 c7 c7 13 53 98 81    mov    $0xffffffff81985313,%rdi
 | |
| ffffffff8104422c:       31 c0                   xor    %eax,%eax
 | |
| ffffffff8104422e:       e8 60 0f 6d 00          callq  ffffffff81715193 <printk>
 | |
| ffffffff81044233:       eb c9                   jmp    ffffffff810441fe <sys_getppid+0xe>
 | |
| ffffffff81044235:       66 66 2e 0f 1f 84 00    data32 nopw %cs:0x0(%rax,%rax,1)
 | |
| ffffffff8104423c:       00 00 00 00
 | |
| 
 | |
| Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
 | |
| vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
 | |
| to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
 | |
| label case adds:
 | |
| 
 | |
| 6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.
 | |
| 
 | |
| If we then include the padding bytes, the jump label code saves, 16 total bytes
 | |
| of instruction memory for this small function. In this case the non-jump label
 | |
| function is 80 bytes long. Thus, we have saved 20% of the instruction
 | |
| footprint. We can in fact improve this even further, since the 5-byte no-op
 | |
| really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.
 | |
| However, we have not yet implemented optimal no-op sizes (they are currently
 | |
| hard-coded).
 | |
| 
 | |
| Since there are a number of static key API uses in the scheduler paths,
 | |
| 'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
 | |
| performance improvement. Testing done on 3.3.0-rc2:
 | |
| 
 | |
| jump label disabled:
 | |
| 
 | |
|  Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
 | |
| 
 | |
|         855.700314 task-clock                #    0.534 CPUs utilized            ( +-  0.11% )
 | |
|            200,003 context-switches          #    0.234 M/sec                    ( +-  0.00% )
 | |
|                  0 CPU-migrations            #    0.000 M/sec                    ( +- 39.58% )
 | |
|                487 page-faults               #    0.001 M/sec                    ( +-  0.02% )
 | |
|      1,474,374,262 cycles                    #    1.723 GHz                      ( +-  0.17% )
 | |
|    <not supported> stalled-cycles-frontend
 | |
|    <not supported> stalled-cycles-backend
 | |
|      1,178,049,567 instructions              #    0.80  insns per cycle          ( +-  0.06% )
 | |
|        208,368,926 branches                  #  243.507 M/sec                    ( +-  0.06% )
 | |
|          5,569,188 branch-misses             #    2.67% of all branches          ( +-  0.54% )
 | |
| 
 | |
|        1.601607384 seconds time elapsed                                          ( +-  0.07% )
 | |
| 
 | |
| jump label enabled:
 | |
| 
 | |
|  Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
 | |
| 
 | |
|         841.043185 task-clock                #    0.533 CPUs utilized            ( +-  0.12% )
 | |
|            200,004 context-switches          #    0.238 M/sec                    ( +-  0.00% )
 | |
|                  0 CPU-migrations            #    0.000 M/sec                    ( +- 40.87% )
 | |
|                487 page-faults               #    0.001 M/sec                    ( +-  0.05% )
 | |
|      1,432,559,428 cycles                    #    1.703 GHz                      ( +-  0.18% )
 | |
|    <not supported> stalled-cycles-frontend
 | |
|    <not supported> stalled-cycles-backend
 | |
|      1,175,363,994 instructions              #    0.82  insns per cycle          ( +-  0.04% )
 | |
|        206,859,359 branches                  #  245.956 M/sec                    ( +-  0.04% )
 | |
|          4,884,119 branch-misses             #    2.36% of all branches          ( +-  0.85% )
 | |
| 
 | |
|        1.579384366 seconds time elapsed
 | |
| 
 | |
| The percentage of saved branches is .7%, and we've saved 12% on
 | |
| 'branch-misses'. This is where we would expect to get the most savings, since
 | |
| this optimization is about reducing the number of branches. In addition, we've
 | |
| saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.
 |