mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	 f2094107ac
			
		
	
	
		f2094107ac
		
	
	
	
	
		
			
			doc.2017.04.12a: Documentation updates fixes.2017.04.19a: Miscellaneous fixes srcu.2017.04.21a: Parallelize SRCU callback handling
		
			
				
	
	
		
			1436 lines
		
	
	
		
			57 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			1436 lines
		
	
	
		
			57 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 | |
|         "http://www.w3.org/TR/html4/loose.dtd">
 | |
|         <html>
 | |
|         <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title>
 | |
|         <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
 | |
| 
 | |
|            <p>December 18, 2016</p>
 | |
|            <p>This article was contributed by Paul E. McKenney</p>
 | |
| 
 | |
| <h3>Introduction</h3>
 | |
| 
 | |
| This document describes RCU's major data structures and their relationship
 | |
| to each other.
 | |
| 
 | |
| <ol>
 | |
| <li>	<a href="#Data-Structure Relationships">
 | |
| 	Data-Structure Relationships</a>
 | |
| <li>	<a href="#The rcu_state Structure">
 | |
| 	The <tt>rcu_state</tt> Structure</a>
 | |
| <li>	<a href="#The rcu_node Structure">
 | |
| 	The <tt>rcu_node</tt> Structure</a>
 | |
| <li>	<a href="#The rcu_segcblist Structure">
 | |
| 	The <tt>rcu_segcblist</tt> Structure</a>
 | |
| <li>	<a href="#The rcu_data Structure">
 | |
| 	The <tt>rcu_data</tt> Structure</a>
 | |
| <li>	<a href="#The rcu_dynticks Structure">
 | |
| 	The <tt>rcu_dynticks</tt> Structure</a>
 | |
| <li>	<a href="#The rcu_head Structure">
 | |
| 	The <tt>rcu_head</tt> Structure</a>
 | |
| <li>	<a href="#RCU-Specific Fields in the task_struct Structure">
 | |
| 	RCU-Specific Fields in the <tt>task_struct</tt> Structure</a>
 | |
| <li>	<a href="#Accessor Functions">
 | |
| 	Accessor Functions</a>
 | |
| </ol>
 | |
| 
 | |
| <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3>
 | |
| 
 | |
| <p>RCU is for all intents and purposes a large state machine, and its
 | |
| data structures maintain the state in such a way as to allow RCU readers
 | |
| to execute extremely quickly, while also processing the RCU grace periods
 | |
| requested by updaters in an efficient and extremely scalable fashion.
 | |
| The efficiency and scalability of RCU updaters is provided primarily
 | |
| by a combining tree, as shown below:
 | |
| 
 | |
| </p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%">
 | |
| 
 | |
| </p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure
 | |
| containing a tree of <tt>rcu_node</tt> structures.
 | |
| Each leaf node of the <tt>rcu_node</tt> tree has up to 16
 | |
| <tt>rcu_data</tt> structures associated with it, so that there
 | |
| are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures,
 | |
| one for each possible CPU.
 | |
| This structure is adjusted at boot time, if needed, to handle the
 | |
| common case where <tt>nr_cpu_ids</tt> is much less than
 | |
| <tt>NR_CPUs</tt>.
 | |
| For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>,
 | |
| which results in a three-level <tt>rcu_node</tt> tree.
 | |
| If the actual hardware has only 16 CPUs, RCU will adjust itself
 | |
| at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node.
 | |
| 
 | |
| </p><p>The purpose of this combining tree is to allow per-CPU events
 | |
| such as quiescent states, dyntick-idle transitions,
 | |
| and CPU hotplug operations to be processed efficiently
 | |
| and scalably.
 | |
| Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures,
 | |
| and other events are recorded by the leaf-level <tt>rcu_node</tt>
 | |
| structures.
 | |
| All of these events are combined at each level of the tree until finally
 | |
| grace periods are completed at the tree's root <tt>rcu_node</tt>
 | |
| structure.
 | |
| A grace period can be completed at the root once every CPU
 | |
| (or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task)
 | |
| has passed through a quiescent state.
 | |
| Once a grace period has completed, record of that fact is propagated
 | |
| back down the tree.
 | |
| 
 | |
| </p><p>As can be seen from the diagram, on a 64-bit system
 | |
| a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
 | |
| of 64 at the root and a fanout of 16 at the leaves.
 | |
| 
 | |
| <table>
 | |
| <tr><th> </th></tr>
 | |
| <tr><th align="left">Quick Quiz:</th></tr>
 | |
| <tr><td>
 | |
| 	Why isn't the fanout at the leaves also 64?
 | |
| </td></tr>
 | |
| <tr><th align="left">Answer:</th></tr>
 | |
| <tr><td bgcolor="#ffffff"><font color="ffffff">
 | |
| 	Because there are more types of events that affect the leaf-level
 | |
| 	<tt>rcu_node</tt> structures than further up the tree.
 | |
| 	Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of
 | |
| 	64, the contention on these structures' <tt>->structures</tt>
 | |
| 	becomes excessive.
 | |
| 	Experimentation on a wide variety of systems has shown that a fanout
 | |
| 	of 16 works well for the leaves of the <tt>rcu_node</tt> tree.
 | |
| 	</font>
 | |
| 
 | |
| 	<p><font color="ffffff">Of course, further experience with
 | |
| 	systems having hundreds or thousands of CPUs may demonstrate
 | |
| 	that the fanout for the non-leaf <tt>rcu_node</tt> structures
 | |
| 	must also be reduced.
 | |
| 	Such reduction can be easily carried out when and if it proves
 | |
| 	necessary.
 | |
| 	In the meantime, if you are using such a system and running into
 | |
| 	contention problems on the non-leaf <tt>rcu_node</tt> structures,
 | |
| 	you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration
 | |
| 	parameter to reduce the non-leaf fanout as needed.
 | |
| 	</font>
 | |
| 
 | |
| 	<p><font color="ffffff">Kernels built for systems with
 | |
| 	strong NUMA characteristics might also need to adjust
 | |
| 	<tt>CONFIG_RCU_FANOUT</tt> so that the domains of the
 | |
| 	<tt>rcu_node</tt> structures align with hardware boundaries.
 | |
| 	However, there has thus far been no need for this.
 | |
| </font></td></tr>
 | |
| <tr><td> </td></tr>
 | |
| </table>
 | |
| 
 | |
| <p>If your system has more than 1,024 CPUs (or more than 512 CPUs on
 | |
| a 32-bit system), then RCU will automatically add more levels to the
 | |
| tree.
 | |
| For example, if you are crazy enough to build a 64-bit system with 65,536
 | |
| CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows:
 | |
| 
 | |
| </p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%">
 | |
| 
 | |
| </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system
 | |
| accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
 | |
| 32-bit systems.
 | |
| On the other hand, you can set <tt>CONFIG_RCU_FANOUT</tt> to be
 | |
| as small as 2 if you wish, which would permit only 16 CPUs, which
 | |
| is useful for testing.
 | |
| 
 | |
| </p><p>This multi-level combining tree allows us to get most of the
 | |
| performance and scalability
 | |
| benefits of partitioning, even though RCU grace-period detection is
 | |
| inherently a global operation.
 | |
| The trick here is that only the last CPU to report a quiescent state
 | |
| into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt>
 | |
| structure at the next level up the tree.
 | |
| This means that at the leaf-level <tt>rcu_node</tt> structure, only
 | |
| one access out of sixteen will progress up the tree.
 | |
| For the internal <tt>rcu_node</tt> structures, the situation is even
 | |
| more extreme:  Only one access out of sixty-four will progress up
 | |
| the tree.
 | |
| Because the vast majority of the CPUs do not progress up the tree,
 | |
| the lock contention remains roughly constant up the tree.
 | |
| No matter how many CPUs there are in the system, at most 64 quiescent-state
 | |
| reports per grace period will progress all the way to the root
 | |
| <tt>rcu_node</tt> structure, thus ensuring that the lock contention
 | |
| on that root <tt>rcu_node</tt> structure remains acceptably low.
 | |
| 
 | |
| </p><p>In effect, the combining tree acts like a big shock absorber,
 | |
| keeping lock contention under control at all tree levels regardless
 | |
| of the level of loading on the system.
 | |
| 
 | |
| </p><p>The Linux kernel actually supports multiple flavors of RCU
 | |
| running concurrently, so RCU builds separate data structures for each
 | |
| flavor.
 | |
| For example, for <tt>CONFIG_TREE_RCU=y</tt> kernels, RCU provides
 | |
| rcu_sched and rcu_bh, as shown below:
 | |
| 
 | |
| </p><p><img src="BigTreeClassicRCUBH.svg" alt="BigTreeClassicRCUBH.svg" width="33%">
 | |
| 
 | |
| </p><p>Energy efficiency is increasingly important, and for that
 | |
| reason the Linux kernel provides <tt>CONFIG_NO_HZ_IDLE</tt>, which
 | |
| turns off the scheduling-clock interrupts on idle CPUs, which in
 | |
| turn allows those CPUs to attain deeper sleep states and to consume
 | |
| less energy.
 | |
| CPUs whose scheduling-clock interrupts have been turned off are
 | |
| said to be in <i>dyntick-idle mode</i>.
 | |
| RCU must handle dyntick-idle CPUs specially
 | |
| because RCU would otherwise wake up each CPU on every grace period,
 | |
| which would defeat the whole purpose of <tt>CONFIG_NO_HZ_IDLE</tt>.
 | |
| RCU uses the <tt>rcu_dynticks</tt> structure to track
 | |
| which CPUs are in dyntick idle mode, as shown below:
 | |
| 
 | |
| </p><p><img src="BigTreeClassicRCUBHdyntick.svg" alt="BigTreeClassicRCUBHdyntick.svg" width="33%">
 | |
| 
 | |
| </p><p>However, if a CPU is in dyntick-idle mode, it is in that mode
 | |
| for all flavors of RCU.
 | |
| Therefore, a single <tt>rcu_dynticks</tt> structure is allocated per
 | |
| CPU, and all of a given CPU's <tt>rcu_data</tt> structures share
 | |
| that <tt>rcu_dynticks</tt>, as shown in the figure.
 | |
| 
 | |
| </p><p>Kernels built with <tt>CONFIG_PREEMPT_RCU</tt> support
 | |
| rcu_preempt in addition to rcu_sched and rcu_bh, as shown below:
 | |
| 
 | |
| </p><p><img src="BigTreePreemptRCUBHdyntick.svg" alt="BigTreePreemptRCUBHdyntick.svg" width="35%">
 | |
| 
 | |
| </p><p>RCU updaters wait for normal grace periods by registering
 | |
| RCU callbacks, either directly via <tt>call_rcu()</tt> and
 | |
| friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>),
 | |
| there being a separate interface per flavor of RCU)
 | |
| or indirectly via <tt>synchronize_rcu()</tt> and friends.
 | |
| RCU callbacks are represented by <tt>rcu_head</tt> structures,
 | |
| which are queued on <tt>rcu_data</tt> structures while they are
 | |
| waiting for a grace period to elapse, as shown in the following figure:
 | |
| 
 | |
| </p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%">
 | |
| 
 | |
| </p><p>This figure shows how <tt>TREE_RCU</tt>'s and
 | |
| <tt>PREEMPT_RCU</tt>'s major data structures are related.
 | |
| Lesser data structures will be introduced with the algorithms that
 | |
| make use of them.
 | |
| 
 | |
| </p><p>Note that each of the data structures in the above figure has
 | |
| its own synchronization:
 | |
| 
 | |
| <p><ol>
 | |
| <li>	Each <tt>rcu_state</tt> structures has a lock and a mutex,
 | |
| 	and some fields are protected by the corresponding root
 | |
| 	<tt>rcu_node</tt> structure's lock.
 | |
| <li>	Each <tt>rcu_node</tt> structure has a spinlock.
 | |
| <li>	The fields in <tt>rcu_data</tt> are private to the corresponding
 | |
| 	CPU, although a few can be read and written by other CPUs.
 | |
| <li>	Similarly, the fields in <tt>rcu_dynticks</tt> are private
 | |
| 	to the corresponding CPU, although a few can be read by
 | |
| 	other CPUs.
 | |
| </ol>
 | |
| 
 | |
| <p>It is important to note that different data structures can have
 | |
| very different ideas about the state of RCU at any given time.
 | |
| For but one example, awareness of the start or end of a given RCU
 | |
| grace period propagates slowly through the data structures.
 | |
| This slow propagation is absolutely necessary for RCU to have good
 | |
| read-side performance.
 | |
| If this balkanized implementation seems foreign to you, one useful
 | |
| trick is to consider each instance of these data structures to be
 | |
| a different person, each having the usual slightly different
 | |
| view of reality.
 | |
| 
 | |
| </p><p>The general role of each of these data structures is as
 | |
| follows:
 | |
| 
 | |
| </p><ol>
 | |
| <li>	<tt>rcu_state</tt>:
 | |
| 	This structure forms the interconnection between the
 | |
| 	<tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
 | |
| 	tracks grace periods, serves as short-term repository
 | |
| 	for callbacks orphaned by CPU-hotplug events,
 | |
| 	maintains <tt>rcu_barrier()</tt> state,
 | |
| 	tracks expedited grace-period state,
 | |
| 	and maintains state used to force quiescent states when
 | |
| 	grace periods extend too long,
 | |
| <li>	<tt>rcu_node</tt>: This structure forms the combining
 | |
| 	tree that propagates quiescent-state
 | |
| 	information from the leaves to the root, and also propagates
 | |
| 	grace-period information from the root to the leaves.
 | |
| 	It provides local copies of the grace-period state in order
 | |
| 	to allow this information to be accessed in a synchronized
 | |
| 	manner without suffering the scalability limitations that
 | |
| 	would otherwise be imposed by global locking.
 | |
| 	In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists
 | |
| 	of tasks that have blocked while in their current
 | |
| 	RCU read-side critical section.
 | |
| 	In <tt>CONFIG_PREEMPT_RCU</tt> with
 | |
| 	<tt>CONFIG_RCU_BOOST</tt>, it manages the
 | |
| 	per-<tt>rcu_node</tt> priority-boosting
 | |
| 	kernel threads (kthreads) and state.
 | |
| 	Finally, it records CPU-hotplug state in order to determine
 | |
| 	which CPUs should be ignored during a given grace period.
 | |
| <li>	<tt>rcu_data</tt>: This per-CPU structure is the
 | |
| 	focus of quiescent-state detection and RCU callback queuing.
 | |
| 	It also tracks its relationship to the corresponding leaf
 | |
| 	<tt>rcu_node</tt> structure to allow more-efficient
 | |
| 	propagation of quiescent states up the <tt>rcu_node</tt>
 | |
| 	combining tree.
 | |
| 	Like the <tt>rcu_node</tt> structure, it provides a local
 | |
| 	copy of the grace-period information to allow for-free
 | |
| 	synchronized
 | |
| 	access to this information from the corresponding CPU.
 | |
| 	Finally, this structure records past dyntick-idle state
 | |
| 	for the corresponding CPU and also tracks statistics.
 | |
| <li>	<tt>rcu_dynticks</tt>:
 | |
| 	This per-CPU structure tracks the current dyntick-idle
 | |
| 	state for the corresponding CPU.
 | |
| 	Unlike the other three structures, the <tt>rcu_dynticks</tt>
 | |
| 	structure is not replicated per RCU flavor.
 | |
| <li>	<tt>rcu_head</tt>:
 | |
| 	This structure represents RCU callbacks, and is the
 | |
| 	only structure allocated and managed by RCU users.
 | |
| 	The <tt>rcu_head</tt> structure is normally embedded
 | |
| 	within the RCU-protected data structure.
 | |
| </ol>
 | |
| 
 | |
| <p>If all you wanted from this article was a general notion of how
 | |
| RCU's data structures are related, you are done.
 | |
| Otherwise, each of the following sections give more details on
 | |
| the <tt>rcu_state</tt>, <tt>rcu_node</tt>, <tt>rcu_data</tt>,
 | |
| and <tt>rcu_dynticks</tt> data structures.
 | |
| 
 | |
| <h3><a name="The rcu_state Structure">
 | |
| The <tt>rcu_state</tt> Structure</a></h3>
 | |
| 
 | |
| <p>The <tt>rcu_state</tt> structure is the base structure that
 | |
| represents a flavor of RCU.
 | |
| This structure forms the interconnection between the
 | |
| <tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
 | |
| tracks grace periods, contains the lock used to
 | |
| synchronize with CPU-hotplug events,
 | |
| and maintains state used to force quiescent states when
 | |
| grace periods extend too long,
 | |
| 
 | |
| </p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed,
 | |
| singly and in groups, in the following sections.
 | |
| The more specialized fields are covered in the discussion of their
 | |
| use.
 | |
| 
 | |
| <h5>Relationship to rcu_node and rcu_data Structures</h5>
 | |
| 
 | |
| This portion of the <tt>rcu_state</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   struct rcu_node node[NUM_RCU_NODES];
 | |
|   2   struct rcu_node *level[NUM_RCU_LVLS + 1];
 | |
|   3   struct rcu_data __percpu *rda;
 | |
| </pre>
 | |
| 
 | |
| <table>
 | |
| <tr><th> </th></tr>
 | |
| <tr><th align="left">Quick Quiz:</th></tr>
 | |
| <tr><td>
 | |
| 	Wait a minute!
 | |
| 	You said that the <tt>rcu_node</tt> structures formed a tree,
 | |
| 	but they are declared as a flat array!
 | |
| 	What gives?
 | |
| </td></tr>
 | |
| <tr><th align="left">Answer:</th></tr>
 | |
| <tr><td bgcolor="#ffffff"><font color="ffffff">
 | |
| 	The tree is laid out in the array.
 | |
| 	The first node In the array is the head, the next set of nodes in the
 | |
| 	array are children of the head node, and so on until the last set of
 | |
| 	nodes in the array are the leaves.
 | |
| 	</font>
 | |
| 
 | |
| 	<p><font color="ffffff">See the following diagrams to see how
 | |
| 	this works.
 | |
| </font></td></tr>
 | |
| <tr><td> </td></tr>
 | |
| </table>
 | |
| 
 | |
| <p>The <tt>rcu_node</tt> tree is embedded into the
 | |
| <tt>->node[]</tt> array as shown in the following figure:
 | |
| 
 | |
| </p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%">
 | |
| 
 | |
| </p><p>One interesting consequence of this mapping is that a
 | |
| breadth-first traversal of the tree is implemented as a simple
 | |
| linear scan of the array, which is in fact what the
 | |
| <tt>rcu_for_each_node_breadth_first()</tt> macro does.
 | |
| This macro is used at the beginning and ends of grace periods.
 | |
| 
 | |
| </p><p>Each entry of the <tt>->level</tt> array references
 | |
| the first <tt>rcu_node</tt> structure on the corresponding level
 | |
| of the tree, for example, as shown below:
 | |
| 
 | |
| </p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%">
 | |
| 
 | |
| </p><p>The zero<sup>th</sup> element of the array references the root
 | |
| <tt>rcu_node</tt> structure, the first element references the
 | |
| first child of the root <tt>rcu_node</tt>, and finally the second
 | |
| element references the first leaf <tt>rcu_node</tt> structure.
 | |
| 
 | |
| </p><p>For whatever it is worth, if you draw the tree to be tree-shaped
 | |
| rather than array-shaped, it is easy to draw a planar representation:
 | |
| 
 | |
| </p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%">
 | |
| 
 | |
| </p><p>Finally, the <tt>->rda</tt> field references a per-CPU
 | |
| pointer to the corresponding CPU's <tt>rcu_data</tt> structure.
 | |
| 
 | |
| </p><p>All of these fields are constant once initialization is complete,
 | |
| and therefore need no protection.
 | |
| 
 | |
| <h5>Grace-Period Tracking</h5>
 | |
| 
 | |
| <p>This portion of the <tt>rcu_state</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   unsigned long gpnum;
 | |
|   2   unsigned long completed;
 | |
| </pre>
 | |
| 
 | |
| <p>RCU grace periods are numbered, and
 | |
| the <tt>->gpnum</tt> field contains the number of the grace
 | |
| period that started most recently.
 | |
| The <tt>->completed</tt> field contains the number of the
 | |
| grace period that completed most recently.
 | |
| If the two fields are equal, the RCU grace period that most recently
 | |
| started has already completed, and therefore the corresponding
 | |
| flavor of RCU is idle.
 | |
| If <tt>->gpnum</tt> is one greater than <tt>->completed</tt>,
 | |
| then <tt>->gpnum</tt> gives the number of the current RCU
 | |
| grace period, which has not yet completed.
 | |
| Any other combination of values indicates that something is broken.
 | |
| These two fields are protected by the root <tt>rcu_node</tt>'s
 | |
| <tt>->lock</tt> field.
 | |
| 
 | |
| </p><p>There are <tt>->gpnum</tt> and <tt>->completed</tt> fields
 | |
| in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures
 | |
| as well.
 | |
| The fields in the <tt>rcu_state</tt> structure represent the
 | |
| most current values, and those of the other structures are compared
 | |
| in order to detect the start of a new grace period in a distributed
 | |
| fashion.
 | |
| The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt>
 | |
| (down the tree from the root to the leaves) to <tt>rcu_data</tt>.
 | |
| 
 | |
| <h5>Miscellaneous</h5>
 | |
| 
 | |
| <p>This portion of the <tt>rcu_state</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   unsigned long gp_max;
 | |
|   2   char abbr;
 | |
|   3   char *name;
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>->gp_max</tt> field tracks the duration of the longest
 | |
| grace period in jiffies.
 | |
| It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>.
 | |
| 
 | |
| <p>The <tt>->name</tt> field points to the name of the RCU flavor
 | |
| (for example, “rcu_sched”), and is constant.
 | |
| The <tt>->abbr</tt> field contains a one-character abbreviation,
 | |
| for example, “s” for RCU-sched.
 | |
| 
 | |
| <h3><a name="The rcu_node Structure">
 | |
| The <tt>rcu_node</tt> Structure</a></h3>
 | |
| 
 | |
| <p>The <tt>rcu_node</tt> structures form the combining
 | |
| tree that propagates quiescent-state
 | |
| information from the leaves to the root and also that propagates
 | |
| grace-period information from the root down to the leaves.
 | |
| They provides local copies of the grace-period state in order
 | |
| to allow this information to be accessed in a synchronized
 | |
| manner without suffering the scalability limitations that
 | |
| would otherwise be imposed by global locking.
 | |
| In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists
 | |
| of tasks that have blocked while in their current
 | |
| RCU read-side critical section.
 | |
| In <tt>CONFIG_PREEMPT_RCU</tt> with
 | |
| <tt>CONFIG_RCU_BOOST</tt>, they manage the
 | |
| per-<tt>rcu_node</tt> priority-boosting
 | |
| kernel threads (kthreads) and state.
 | |
| Finally, they record CPU-hotplug state in order to determine
 | |
| which CPUs should be ignored during a given grace period.
 | |
| 
 | |
| </p><p>The <tt>rcu_node</tt> structure's fields are discussed,
 | |
| singly and in groups, in the following sections.
 | |
| 
 | |
| <h5>Connection to Combining Tree</h5>
 | |
| 
 | |
| <p>This portion of the <tt>rcu_node</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   struct rcu_node *parent;
 | |
|   2   u8 level;
 | |
|   3   u8 grpnum;
 | |
|   4   unsigned long grpmask;
 | |
|   5   int grplo;
 | |
|   6   int grphi;
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt>
 | |
| one level up in the tree, and is <tt>NULL</tt> for the root
 | |
| <tt>rcu_node</tt>.
 | |
| The RCU implementation makes heavy use of this field to push quiescent
 | |
| states up the tree.
 | |
| The <tt>->level</tt> field gives the level in the tree, with
 | |
| the root being at level zero, its children at level one, and so on.
 | |
| The <tt>->grpnum</tt> field gives this node's position within
 | |
| the children of its parent, so this number can range between 0 and 31
 | |
| on 32-bit systems and between 0 and 63 on 64-bit systems.
 | |
| The <tt>->level</tt> and <tt>->grpnum</tt> fields are
 | |
| used only during initialization and for tracing.
 | |
| The <tt>->grpmask</tt> field is the bitmask counterpart of
 | |
| <tt>->grpnum</tt>, and therefore always has exactly one bit set.
 | |
| This mask is used to clear the bit corresponding to this <tt>rcu_node</tt>
 | |
| structure in its parent's bitmasks, which are described later.
 | |
| Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields
 | |
| contain the lowest and highest numbered CPU served by this
 | |
| <tt>rcu_node</tt> structure, respectively.
 | |
| 
 | |
| </p><p>All of these fields are constant, and thus do not require any
 | |
| synchronization.
 | |
| 
 | |
| <h5>Synchronization</h5>
 | |
| 
 | |
| <p>This field of the <tt>rcu_node</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   raw_spinlock_t lock;
 | |
| </pre>
 | |
| 
 | |
| <p>This field is used to protect the remaining fields in this structure,
 | |
| unless otherwise stated.
 | |
| That said, all of the fields in this structure can be accessed without
 | |
| locking for tracing purposes.
 | |
| Yes, this can result in confusing traces, but better some tracing confusion
 | |
| than to be heisenbugged out of existence.
 | |
| 
 | |
| <h5>Grace-Period Tracking</h5>
 | |
| 
 | |
| <p>This portion of the <tt>rcu_node</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   unsigned long gpnum;
 | |
|   2   unsigned long completed;
 | |
| </pre>
 | |
| 
 | |
| <p>These fields are the counterparts of the fields of the same name in
 | |
| the <tt>rcu_state</tt> structure.
 | |
| They each may lag up to one behind their <tt>rcu_state</tt>
 | |
| counterparts.
 | |
| If a given <tt>rcu_node</tt> structure's <tt>->gpnum</tt> and
 | |
| <tt>->complete</tt> fields are equal, then this <tt>rcu_node</tt>
 | |
| structure believes that RCU is idle.
 | |
| Otherwise, as with the <tt>rcu_state</tt> structure,
 | |
| the <tt>->gpnum</tt> field will be one greater than the
 | |
| <tt>->complete</tt> fields, with <tt>->gpnum</tt>
 | |
| indicating which grace period this <tt>rcu_node</tt> believes
 | |
| is still being waited for.
 | |
| 
 | |
| </p><p>The <tt>>gpnum</tt> field of each <tt>rcu_node</tt>
 | |
| structure is updated at the beginning
 | |
| of each grace period, and the <tt>->completed</tt> fields are
 | |
| updated at the end of each grace period.
 | |
| 
 | |
| <h5>Quiescent-State Tracking</h5>
 | |
| 
 | |
| <p>These fields manage the propagation of quiescent states up the
 | |
| combining tree.
 | |
| 
 | |
| </p><p>This portion of the <tt>rcu_node</tt> structure has fields
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   unsigned long qsmask;
 | |
|   2   unsigned long expmask;
 | |
|   3   unsigned long qsmaskinit;
 | |
|   4   unsigned long expmaskinit;
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>->qsmask</tt> field tracks which of this
 | |
| <tt>rcu_node</tt> structure's children still need to report
 | |
| quiescent states for the current normal grace period.
 | |
| Such children will have a value of 1 in their corresponding bit.
 | |
| Note that the leaf <tt>rcu_node</tt> structures should be
 | |
| thought of as having <tt>rcu_data</tt> structures as their
 | |
| children.
 | |
| Similarly, the <tt>->expmask</tt> field tracks which
 | |
| of this <tt>rcu_node</tt> structure's children still need to report
 | |
| quiescent states for the current expedited grace period.
 | |
| An expedited grace period has
 | |
| the same conceptual properties as a normal grace period, but the
 | |
| expedited implementation accepts extreme CPU overhead to obtain
 | |
| much lower grace-period latency, for example, consuming a few
 | |
| tens of microseconds worth of CPU time to reduce grace-period
 | |
| duration from milliseconds to tens of microseconds.
 | |
| The <tt>->qsmaskinit</tt> field tracks which of this
 | |
| <tt>rcu_node</tt> structure's children cover for at least
 | |
| one online CPU.
 | |
| This mask is used to initialize <tt>->qsmask</tt>,
 | |
| and <tt>->expmaskinit</tt> is used to initialize
 | |
| <tt>->expmask</tt> and the beginning of the
 | |
| normal and expedited grace periods, respectively.
 | |
| 
 | |
| <table>
 | |
| <tr><th> </th></tr>
 | |
| <tr><th align="left">Quick Quiz:</th></tr>
 | |
| <tr><td>
 | |
| 	Why are these bitmasks protected by locking?
 | |
| 	Come on, haven't you heard of atomic instructions???
 | |
| </td></tr>
 | |
| <tr><th align="left">Answer:</th></tr>
 | |
| <tr><td bgcolor="#ffffff"><font color="ffffff">
 | |
| 	Lockless grace-period computation!  Such a tantalizing possibility!
 | |
| 	</font>
 | |
| 
 | |
| 	<p><font color="ffffff">But consider the following sequence of events:
 | |
| 	</font>
 | |
| 
 | |
| 	<ol>
 | |
| 	<li>	<font color="ffffff">CPU 0 has been in dyntick-idle
 | |
| 		mode for quite some time.
 | |
| 		When it wakes up, it notices that the current RCU
 | |
| 		grace period needs it to report in, so it sets a
 | |
| 		flag where the scheduling clock interrupt will find it.
 | |
| 		</font><p>
 | |
| 	<li>	<font color="ffffff">Meanwhile, CPU 1 is running
 | |
| 		<tt>force_quiescent_state()</tt>,
 | |
| 		and notices that CPU 0 has been in dyntick idle mode,
 | |
| 		which qualifies as an extended quiescent state.
 | |
| 		</font><p>
 | |
| 	<li>	<font color="ffffff">CPU 0's scheduling clock
 | |
| 		interrupt fires in the
 | |
| 		middle of an RCU read-side critical section, and notices
 | |
| 		that the RCU core needs something, so commences RCU softirq
 | |
| 		processing.
 | |
| 		</font>
 | |
| 		<p>
 | |
| 	<li>	<font color="ffffff">CPU 0's softirq handler
 | |
| 		executes and is just about ready
 | |
| 		to report its quiescent state up the <tt>rcu_node</tt>
 | |
| 		tree.
 | |
| 		</font><p>
 | |
| 	<li>	<font color="ffffff">But CPU 1 beats it to the punch,
 | |
| 		completing the current
 | |
| 		grace period and starting a new one.
 | |
| 		</font><p>
 | |
| 	<li>	<font color="ffffff">CPU 0 now reports its quiescent
 | |
| 		state for the wrong
 | |
| 		grace period.
 | |
| 		That grace period might now end before the RCU read-side
 | |
| 		critical section.
 | |
| 		If that happens, disaster will ensue.
 | |
| 		</font>
 | |
| 	</ol>
 | |
| 
 | |
| 	<p><font color="ffffff">So the locking is absolutely required in
 | |
| 	order to coordinate
 | |
| 	clearing of the bits with the grace-period numbers in
 | |
| 	<tt>->gpnum</tt> and <tt>->completed</tt>.
 | |
| </font></td></tr>
 | |
| <tr><td> </td></tr>
 | |
| </table>
 | |
| 
 | |
| <h5>Blocked-Task Management</h5>
 | |
| 
 | |
| <p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the
 | |
| midst of their RCU read-side critical sections, and these tasks
 | |
| must be tracked explicitly.
 | |
| The details of exactly why and how they are tracked will be covered
 | |
| in a separate article on RCU read-side processing.
 | |
| For now, it is enough to know that the <tt>rcu_node</tt>
 | |
| structure tracks them.
 | |
| 
 | |
| <pre>
 | |
|   1   struct list_head blkd_tasks;
 | |
|   2   struct list_head *gp_tasks;
 | |
|   3   struct list_head *exp_tasks;
 | |
|   4   bool wait_blkd_tasks;
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>->blkd_tasks</tt> field is a list header for
 | |
| the list of blocked and preempted tasks.
 | |
| As tasks undergo context switches within RCU read-side critical
 | |
| sections, their <tt>task_struct</tt> structures are enqueued
 | |
| (via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt>
 | |
| field) onto the head of the <tt>->blkd_tasks</tt> list for the
 | |
| leaf <tt>rcu_node</tt> structure corresponding to the CPU
 | |
| on which the outgoing context switch executed.
 | |
| As these tasks later exit their RCU read-side critical sections,
 | |
| they remove themselves from the list.
 | |
| This list is therefore in reverse time order, so that if one of the tasks
 | |
| is blocking the current grace period, all subsequent tasks must
 | |
| also be blocking that same grace period.
 | |
| Therefore, a single pointer into this list suffices to track
 | |
| all tasks blocking a given grace period.
 | |
| That pointer is stored in <tt>->gp_tasks</tt> for normal
 | |
| grace periods and in <tt>->exp_tasks</tt> for expedited
 | |
| grace periods.
 | |
| These last two fields are <tt>NULL</tt> if either there is
 | |
| no grace period in flight or if there are no blocked tasks
 | |
| preventing that grace period from completing.
 | |
| If either of these two pointers is referencing a task that
 | |
| removes itself from the <tt>->blkd_tasks</tt> list,
 | |
| then that task must advance the pointer to the next task on
 | |
| the list, or set the pointer to <tt>NULL</tt> if there
 | |
| are no subsequent tasks on the list.
 | |
| 
 | |
| </p><p>For example, suppose that tasks T1, T2, and T3 are
 | |
| all hard-affinitied to the largest-numbered CPU in the system.
 | |
| Then if task T1 blocked in an RCU read-side
 | |
| critical section, then an expedited grace period started,
 | |
| then task T2 blocked in an RCU read-side critical section,
 | |
| then a normal grace period started, and finally task 3 blocked
 | |
| in an RCU read-side critical section, then the state of the
 | |
| last leaf <tt>rcu_node</tt> structure's blocked-task list
 | |
| would be as shown below:
 | |
| 
 | |
| </p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%">
 | |
| 
 | |
| </p><p>Task T1 is blocking both grace periods, task T2 is
 | |
| blocking only the normal grace period, and task T3 is blocking
 | |
| neither grace period.
 | |
| Note that these tasks will not remove themselves from this list
 | |
| immediately upon resuming execution.
 | |
| They will instead remain on the list until they execute the outermost
 | |
| <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical
 | |
| section.
 | |
| 
 | |
| <p>
 | |
| The <tt>->wait_blkd_tasks</tt> field indicates whether or not
 | |
| the current grace period is waiting on a blocked task.
 | |
| 
 | |
| <h5>Sizing the <tt>rcu_node</tt> Array</h5>
 | |
| 
 | |
| <p>The <tt>rcu_node</tt> array is sized via a series of
 | |
| C-preprocessor expressions as follows:
 | |
| 
 | |
| <pre>
 | |
|  1 #ifdef CONFIG_RCU_FANOUT
 | |
|  2 #define RCU_FANOUT CONFIG_RCU_FANOUT
 | |
|  3 #else
 | |
|  4 # ifdef CONFIG_64BIT
 | |
|  5 # define RCU_FANOUT 64
 | |
|  6 # else
 | |
|  7 # define RCU_FANOUT 32
 | |
|  8 # endif
 | |
|  9 #endif
 | |
| 10
 | |
| 11 #ifdef CONFIG_RCU_FANOUT_LEAF
 | |
| 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
 | |
| 13 #else
 | |
| 14 # ifdef CONFIG_64BIT
 | |
| 15 # define RCU_FANOUT_LEAF 64
 | |
| 16 # else
 | |
| 17 # define RCU_FANOUT_LEAF 32
 | |
| 18 # endif
 | |
| 19 #endif
 | |
| 20
 | |
| 21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF)
 | |
| 22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT)
 | |
| 23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT)
 | |
| 24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT)
 | |
| 25
 | |
| 26 #if NR_CPUS <= RCU_FANOUT_1
 | |
| 27 #  define RCU_NUM_LVLS        1
 | |
| 28 #  define NUM_RCU_LVL_0        1
 | |
| 29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0
 | |
| 30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 }
 | |
| 31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
 | |
| 32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
 | |
| 33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" }
 | |
| 34 #elif NR_CPUS <= RCU_FANOUT_2
 | |
| 35 #  define RCU_NUM_LVLS        2
 | |
| 36 #  define NUM_RCU_LVL_0        1
 | |
| 37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
 | |
| 38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
 | |
| 39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
 | |
| 40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
 | |
| 41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
 | |
| 42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" }
 | |
| 43 #elif NR_CPUS <= RCU_FANOUT_3
 | |
| 44 #  define RCU_NUM_LVLS        3
 | |
| 45 #  define NUM_RCU_LVL_0        1
 | |
| 46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
 | |
| 47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
 | |
| 48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
 | |
| 49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
 | |
| 50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
 | |
| 51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
 | |
| 52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
 | |
| 53 #elif NR_CPUS <= RCU_FANOUT_4
 | |
| 54 #  define RCU_NUM_LVLS        4
 | |
| 55 #  define NUM_RCU_LVL_0        1
 | |
| 56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
 | |
| 57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
 | |
| 58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
 | |
| 59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
 | |
| 60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
 | |
| 61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
 | |
| 62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
 | |
| 63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
 | |
| 64 #else
 | |
| 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
 | |
| 66 #endif
 | |
| </pre>
 | |
| 
 | |
| <p>The maximum number of levels in the <tt>rcu_node</tt> structure
 | |
| is currently limited to four, as specified by lines 21-24
 | |
| and the structure of the subsequent “if” statement.
 | |
| For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which
 | |
| should be sufficient for the next few years at least.
 | |
| For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which
 | |
| should see us through the next decade or so.
 | |
| This four-level tree also allows kernels built with
 | |
| <tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs,
 | |
| which might be useful in very large systems having eight CPUs per
 | |
| socket (but please note that no one has yet shown any measurable
 | |
| performance degradation due to misaligned socket and <tt>rcu_node</tt>
 | |
| boundaries).
 | |
| In addition, building kernels with a full four levels of <tt>rcu_node</tt>
 | |
| tree permits better testing of RCU's combining-tree code.
 | |
| 
 | |
| </p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children
 | |
| are permitted at each non-leaf level of the <tt>rcu_node</tt> tree.
 | |
| If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified,
 | |
| it is set based on the word size of the system, which is also
 | |
| the Kconfig default.
 | |
| 
 | |
| </p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are
 | |
| handled by each leaf <tt>rcu_node</tt> structure.
 | |
| Experience has shown that allowing a given leaf <tt>rcu_node</tt>
 | |
| structure to handle 64 CPUs, as permitted by the number of bits in
 | |
| the <tt>->qsmask</tt> field on a 64-bit system, results in
 | |
| excessive contention for the leaf <tt>rcu_node</tt> structures'
 | |
| <tt>->lock</tt> fields.
 | |
| The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore
 | |
| limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>.
 | |
| If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value
 | |
| selected is based on the word size of the system, just as for
 | |
| <tt>CONFIG_RCU_FANOUT</tt>.
 | |
| Lines 11-19 perform this computation.
 | |
| 
 | |
| </p><p>Lines 21-24 compute the maximum number of CPUs supported by
 | |
| a single-level (which contains a single <tt>rcu_node</tt> structure),
 | |
| two-level, three-level, and four-level <tt>rcu_node</tt> tree,
 | |
| respectively, given the fanout specified by <tt>RCU_FANOUT</tt>
 | |
| and <tt>RCU_FANOUT_LEAF</tt>.
 | |
| These numbers of CPUs are retained in the
 | |
| <tt>RCU_FANOUT_1</tt>,
 | |
| <tt>RCU_FANOUT_2</tt>,
 | |
| <tt>RCU_FANOUT_3</tt>, and
 | |
| <tt>RCU_FANOUT_4</tt>
 | |
| C-preprocessor variables, respectively.
 | |
| 
 | |
| </p><p>These variables are used to control the C-preprocessor <tt>#if</tt>
 | |
| statement spanning lines 26-66 that computes the number of
 | |
| <tt>rcu_node</tt> structures required for each level of the tree,
 | |
| as well as the number of levels required.
 | |
| The number of levels is placed in the <tt>NUM_RCU_LVLS</tt>
 | |
| C-preprocessor variable by lines 27, 35, 44, and 54.
 | |
| The number of <tt>rcu_node</tt> structures for the topmost level
 | |
| of the tree is always exactly one, and this value is unconditionally
 | |
| placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55.
 | |
| The rest of the levels (if any) of the <tt>rcu_node</tt> tree
 | |
| are computed by dividing the maximum number of CPUs by the
 | |
| fanout supported by the number of levels from the current level down,
 | |
| rounding up.  This computation is performed by lines 37,
 | |
| 46-47, and 56-58.
 | |
| Lines 31-33, 40-42, 50-52, and 62-63 create initializers
 | |
| for lockdep lock-class names.
 | |
| Finally, lines 64-66 produce an error if the maximum number of
 | |
| CPUs is too large for the specified fanout.
 | |
| 
 | |
| <h3><a name="The rcu_segcblist Structure">
 | |
| The <tt>rcu_segcblist</tt> Structure</a></h3>
 | |
| 
 | |
| The <tt>rcu_segcblist</tt> structure maintains a segmented list of
 | |
| callbacks as follows:
 | |
| 
 | |
| <pre>
 | |
|  1 #define RCU_DONE_TAIL        0
 | |
|  2 #define RCU_WAIT_TAIL        1
 | |
|  3 #define RCU_NEXT_READY_TAIL  2
 | |
|  4 #define RCU_NEXT_TAIL        3
 | |
|  5 #define RCU_CBLIST_NSEGS     4
 | |
|  6
 | |
|  7 struct rcu_segcblist {
 | |
|  8   struct rcu_head *head;
 | |
|  9   struct rcu_head **tails[RCU_CBLIST_NSEGS];
 | |
| 10   unsigned long gp_seq[RCU_CBLIST_NSEGS];
 | |
| 11   long len;
 | |
| 12   long len_lazy;
 | |
| 13 };
 | |
| </pre>
 | |
| 
 | |
| <p>
 | |
| The segments are as follows:
 | |
| 
 | |
| <ol>
 | |
| <li>	<tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed.
 | |
| 	These callbacks are ready to be invoked.
 | |
| <li>	<tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the
 | |
| 	current grace period.
 | |
| 	Note that different CPUs can have different ideas about which
 | |
| 	grace period is current, hence the <tt>->gp_seq</tt> field.
 | |
| <li>	<tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next
 | |
| 	grace period to start.
 | |
| <li>	<tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been
 | |
| 	associated with a grace period.
 | |
| </ol>
 | |
| 
 | |
| <p>
 | |
| The <tt>->head</tt> pointer references the first callback or
 | |
| is <tt>NULL</tt> if the list contains no callbacks (which is
 | |
| <i>not</i> the same as being empty).
 | |
| Each element of the <tt>->tails[]</tt> array references the
 | |
| <tt>->next</tt> pointer of the last callback in the corresponding
 | |
| segment of the list, or the list's <tt>->head</tt> pointer if
 | |
| that segment and all previous segments are empty.
 | |
| If the corresponding segment is empty but some previous segment is
 | |
| not empty, then the array element is identical to its predecessor.
 | |
| Older callbacks are closer to the head of the list, and new callbacks
 | |
| are added at the tail.
 | |
| This relationship between the <tt>->head</tt> pointer, the
 | |
| <tt>->tails[]</tt> array, and the callbacks is shown in this
 | |
| diagram:
 | |
| 
 | |
| </p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
 | |
| 
 | |
| </p><p>In this figure, the <tt>->head</tt> pointer references the
 | |
| first
 | |
| RCU callback in the list.
 | |
| The <tt>->tails[RCU_DONE_TAIL]</tt> array element references
 | |
| the <tt>->head</tt> pointer itself, indicating that none
 | |
| of the callbacks is ready to invoke.
 | |
| The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback
 | |
| CB 2's <tt>->next</tt> pointer, which indicates that
 | |
| CB 1 and CB 2 are both waiting on the current grace period,
 | |
| give or take possible disagreements about exactly which grace period
 | |
| is the current one.
 | |
| The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element
 | |
| references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt>
 | |
| does, which indicates that there are no callbacks waiting on the next
 | |
| RCU grace period.
 | |
| The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references
 | |
| CB 4's <tt>->next</tt> pointer, indicating that all the
 | |
| remaining RCU callbacks have not yet been assigned to an RCU grace
 | |
| period.
 | |
| Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element
 | |
| always references the last RCU callback's <tt>->next</tt> pointer
 | |
| unless the callback list is empty, in which case it references
 | |
| the <tt>->head</tt> pointer.
 | |
| 
 | |
| <p>
 | |
| There is one additional important special case for the
 | |
| <tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt>
 | |
| when this list is <i>disabled</i>.
 | |
| Lists are disabled when the corresponding CPU is offline or when
 | |
| the corresponding CPU's callbacks are offloaded to a kthread,
 | |
| both of which are described elsewhere.
 | |
| 
 | |
| </p><p>CPUs advance their callbacks from the
 | |
| <tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
 | |
| <tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
 | |
| as grace periods advance.
 | |
| 
 | |
| </p><p>The <tt>->gp_seq[]</tt> array records grace-period
 | |
| numbers corresponding to the list segments.
 | |
| This is what allows different CPUs to have different ideas as to
 | |
| which is the current grace period while still avoiding premature
 | |
| invocation of their callbacks.
 | |
| In particular, this allows CPUs that go idle for extended periods
 | |
| to determine which of their callbacks are ready to be invoked after
 | |
| reawakening.
 | |
| 
 | |
| </p><p>The <tt>->len</tt> counter contains the number of
 | |
| callbacks in <tt>->head</tt>, and the
 | |
| <tt>->len_lazy</tt> contains the number of those callbacks that
 | |
| are known to only free memory, and whose invocation can therefore
 | |
| be safely deferred.
 | |
| 
 | |
| <p><b>Important note</b>: It is the <tt>->len</tt> field that
 | |
| determines whether or not there are callbacks associated with
 | |
| this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt>
 | |
| pointer.
 | |
| The reason for this is that all the ready-to-invoke callbacks
 | |
| (that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted
 | |
| all at once at callback-invocation time.
 | |
| If callback invocation must be postponed, for example, because a
 | |
| high-priority process just woke up on this CPU, then the remaining
 | |
| callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment.
 | |
| Either way, the <tt>->len</tt> and <tt>->len_lazy</tt> counts
 | |
| are adjusted after the corresponding callbacks have been invoked, and so
 | |
| again it is the <tt>->len</tt> count that accurately reflects whether
 | |
| or not there are callbacks associated with this <tt>rcu_segcblist</tt>
 | |
| structure.
 | |
| Of course, off-CPU sampling of the <tt>->len</tt> count requires
 | |
| the use of appropriate synchronization, for example, memory barriers.
 | |
| This synchronization can be a bit subtle, particularly in the case
 | |
| of <tt>rcu_barrier()</tt>.
 | |
| 
 | |
| <h3><a name="The rcu_data Structure">
 | |
| The <tt>rcu_data</tt> Structure</a></h3>
 | |
| 
 | |
| <p>The <tt>rcu_data</tt> maintains the per-CPU state for the
 | |
| corresponding flavor of RCU.
 | |
| The fields in this structure may be accessed only from the corresponding
 | |
| CPU (and from tracing) unless otherwise stated.
 | |
| This structure is the
 | |
| focus of quiescent-state detection and RCU callback queuing.
 | |
| It also tracks its relationship to the corresponding leaf
 | |
| <tt>rcu_node</tt> structure to allow more-efficient
 | |
| propagation of quiescent states up the <tt>rcu_node</tt>
 | |
| combining tree.
 | |
| Like the <tt>rcu_node</tt> structure, it provides a local
 | |
| copy of the grace-period information to allow for-free
 | |
| synchronized
 | |
| access to this information from the corresponding CPU.
 | |
| Finally, this structure records past dyntick-idle state
 | |
| for the corresponding CPU and also tracks statistics.
 | |
| 
 | |
| </p><p>The <tt>rcu_data</tt> structure's fields are discussed,
 | |
| singly and in groups, in the following sections.
 | |
| 
 | |
| <h5>Connection to Other Data Structures</h5>
 | |
| 
 | |
| <p>This portion of the <tt>rcu_data</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   int cpu;
 | |
|   2   struct rcu_state *rsp;
 | |
|   3   struct rcu_node *mynode;
 | |
|   4   struct rcu_dynticks *dynticks;
 | |
|   5   unsigned long grpmask;
 | |
|   6   bool beenonline;
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>->cpu</tt> field contains the number of the
 | |
| corresponding CPU, the <tt>->rsp</tt> pointer references
 | |
| the corresponding <tt>rcu_state</tt> structure (and is most frequently
 | |
| used to locate the name of the corresponding flavor of RCU for tracing),
 | |
| and the <tt>->mynode</tt> field references the corresponding
 | |
| <tt>rcu_node</tt> structure.
 | |
| The <tt>->mynode</tt> is used to propagate quiescent states
 | |
| up the combining tree.
 | |
| <p>The <tt>->dynticks</tt> pointer references the
 | |
| <tt>rcu_dynticks</tt> structure corresponding to this
 | |
| CPU.
 | |
| Recall that a single per-CPU instance of the <tt>rcu_dynticks</tt>
 | |
| structure is shared among all flavors of RCU.
 | |
| These first four fields are constant and therefore require not
 | |
| synchronization.
 | |
| 
 | |
| </p><p>The <tt>->grpmask</tt> field indicates the bit in
 | |
| the <tt>->mynode->qsmask</tt> corresponding to this
 | |
| <tt>rcu_data</tt> structure, and is also used when propagating
 | |
| quiescent states.
 | |
| The <tt>->beenonline</tt> flag is set whenever the corresponding
 | |
| CPU comes online, which means that the debugfs tracing need not dump
 | |
| out any <tt>rcu_data</tt> structure for which this flag is not set.
 | |
| 
 | |
| <h5>Quiescent-State and Grace-Period Tracking</h5>
 | |
| 
 | |
| <p>This portion of the <tt>rcu_data</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   unsigned long completed;
 | |
|   2   unsigned long gpnum;
 | |
|   3   bool cpu_no_qs;
 | |
|   4   bool core_needs_qs;
 | |
|   5   bool gpwrap;
 | |
|   6   unsigned long rcu_qs_ctr_snap;
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>completed</tt> and <tt>gpnum</tt>
 | |
| fields are the counterparts of the fields of the same name
 | |
| in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures.
 | |
| They may each lag up to one behind their <tt>rcu_node</tt>
 | |
| counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and
 | |
| <tt>CONFIG_NO_HZ_FULL</tt> kernels can lag
 | |
| arbitrarily far behind for CPUs in dyntick-idle mode (but these counters
 | |
| will catch up upon exit from dyntick-idle mode).
 | |
| If a given <tt>rcu_data</tt> structure's <tt>->gpnum</tt> and
 | |
| <tt>->complete</tt> fields are equal, then this <tt>rcu_data</tt>
 | |
| structure believes that RCU is idle.
 | |
| Otherwise, as with the <tt>rcu_state</tt> and <tt>rcu_node</tt>
 | |
| structure,
 | |
| the <tt>->gpnum</tt> field will be one greater than the
 | |
| <tt>->complete</tt> fields, with <tt>->gpnum</tt>
 | |
| indicating which grace period this <tt>rcu_data</tt> believes
 | |
| is still being waited for.
 | |
| 
 | |
| <table>
 | |
| <tr><th> </th></tr>
 | |
| <tr><th align="left">Quick Quiz:</th></tr>
 | |
| <tr><td>
 | |
| 	All this replication of the grace period numbers can only cause
 | |
| 	massive confusion.
 | |
| 	Why not just keep a global pair of counters and be done with it???
 | |
| </td></tr>
 | |
| <tr><th align="left">Answer:</th></tr>
 | |
| <tr><td bgcolor="#ffffff"><font color="ffffff">
 | |
| 	Because if there was only a single global pair of grace-period
 | |
| 	numbers, there would need to be a single global lock to allow
 | |
| 	safely accessing and updating them.
 | |
| 	And if we are not going to have a single global lock, we need
 | |
| 	to carefully manage the numbers on a per-node basis.
 | |
| 	Recall from the answer to a previous Quick Quiz that the consequences
 | |
| 	of applying a previously sampled quiescent state to the wrong
 | |
| 	grace period are quite severe.
 | |
| </font></td></tr>
 | |
| <tr><td> </td></tr>
 | |
| </table>
 | |
| 
 | |
| <p>The <tt>->cpu_no_qs</tt> flag indicates that the
 | |
| CPU has not yet passed through a quiescent state,
 | |
| while the <tt>->core_needs_qs</tt> flag indicates that the
 | |
| RCU core needs a quiescent state from the corresponding CPU.
 | |
| The <tt>->gpwrap</tt> field indicates that the corresponding
 | |
| CPU has remained idle for so long that the <tt>completed</tt>
 | |
| and <tt>gpnum</tt> counters are in danger of overflow, which
 | |
| will cause the CPU to disregard the values of its counters on
 | |
| its next exit from idle.
 | |
| Finally, the <tt>rcu_qs_ctr_snap</tt> field is used to detect
 | |
| cases where a given operation has resulted in a quiescent state
 | |
| for all flavors of RCU, for example, <tt>cond_resched_rcu_qs()</tt>.
 | |
| 
 | |
| <h5>RCU Callback Handling</h5>
 | |
| 
 | |
| <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by
 | |
| the same CPU that registered them.
 | |
| This is strictly a cache-locality optimization: callbacks can and
 | |
| do get invoked on CPUs other than the one that registered them.
 | |
| After all, if the CPU that registered a given callback has gone
 | |
| offline before the callback can be invoked, there really is no other
 | |
| choice.
 | |
| 
 | |
| </p><p>This portion of the <tt>rcu_data</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|  1 struct rcu_segcblist cblist;
 | |
|  2 long qlen_last_fqs_check;
 | |
|  3 unsigned long n_cbs_invoked;
 | |
|  4 unsigned long n_nocbs_invoked;
 | |
|  5 unsigned long n_cbs_orphaned;
 | |
|  6 unsigned long n_cbs_adopted;
 | |
|  7 unsigned long n_force_qs_snap;
 | |
|  8 long blimit;
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>->cblist</tt> structure is the segmented callback list
 | |
| described earlier.
 | |
| The CPU advances the callbacks in its <tt>rcu_data</tt> structure
 | |
| whenever it notices that another RCU grace period has completed.
 | |
| The CPU detects the completion of an RCU grace period by noticing
 | |
| that the value of its <tt>rcu_data</tt> structure's
 | |
| <tt>->completed</tt> field differs from that of its leaf
 | |
| <tt>rcu_node</tt> structure.
 | |
| Recall that each <tt>rcu_node</tt> structure's
 | |
| <tt>->completed</tt> field is updated at the end of each
 | |
| grace period.
 | |
| 
 | |
| <p>
 | |
| The <tt>->qlen_last_fqs_check</tt> and
 | |
| <tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent
 | |
| states from <tt>call_rcu()</tt> and friends when callback
 | |
| lists grow excessively long.
 | |
| 
 | |
| </p><p>The <tt>->n_cbs_invoked</tt>,
 | |
| <tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt>
 | |
| fields count the number of callbacks invoked,
 | |
| sent to other CPUs when this CPU goes offline,
 | |
| and received from other CPUs when those other CPUs go offline.
 | |
| The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks
 | |
| are offloaded to a kthread.
 | |
| 
 | |
| <p>
 | |
| Finally, the <tt>->blimit</tt> counter is the maximum number of
 | |
| RCU callbacks that may be invoked at a given time.
 | |
| 
 | |
| <h5>Dyntick-Idle Handling</h5>
 | |
| 
 | |
| <p>This portion of the <tt>rcu_data</tt> structure is declared
 | |
| as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   int dynticks_snap;
 | |
|   2   unsigned long dynticks_fqs;
 | |
| </pre>
 | |
| 
 | |
| The <tt>->dynticks_snap</tt> field is used to take a snapshot
 | |
| of the corresponding CPU's dyntick-idle state when forcing
 | |
| quiescent states, and is therefore accessed from other CPUs.
 | |
| Finally, the <tt>->dynticks_fqs</tt> field is used to
 | |
| count the number of times this CPU is determined to be in
 | |
| dyntick-idle state, and is used for tracing and debugging purposes.
 | |
| 
 | |
| <h3><a name="The rcu_dynticks Structure">
 | |
| The <tt>rcu_dynticks</tt> Structure</a></h3>
 | |
| 
 | |
| <p>The <tt>rcu_dynticks</tt> maintains the per-CPU dyntick-idle state
 | |
| for the corresponding CPU.
 | |
| Unlike the other structures, <tt>rcu_dynticks</tt> is not
 | |
| replicated over the different flavors of RCU.
 | |
| The fields in this structure may be accessed only from the corresponding
 | |
| CPU (and from tracing) unless otherwise stated.
 | |
| Its fields are as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   int dynticks_nesting;
 | |
|   2   int dynticks_nmi_nesting;
 | |
|   3   atomic_t dynticks;
 | |
|   4   bool rcu_need_heavy_qs;
 | |
|   5   unsigned long rcu_qs_ctr;
 | |
|   6   bool rcu_urgent_qs;
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>->dynticks_nesting</tt> field counts the
 | |
| nesting depth of normal interrupts.
 | |
| In addition, this counter is incremented when exiting dyntick-idle
 | |
| mode and decremented when entering it.
 | |
| This counter can therefore be thought of as counting the number
 | |
| of reasons why this CPU cannot be permitted to enter dyntick-idle
 | |
| mode, aside from non-maskable interrupts (NMIs).
 | |
| NMIs are counted by the <tt>->dynticks_nmi_nesting</tt>
 | |
| field, except that NMIs that interrupt non-dyntick-idle execution
 | |
| are not counted.
 | |
| 
 | |
| </p><p>The <tt>->dynticks</tt> field counts the corresponding
 | |
| CPU's transitions to and from dyntick-idle mode, so that this counter
 | |
| has an even value when the CPU is in dyntick-idle mode and an odd
 | |
| value otherwise.
 | |
| 
 | |
| </p><p>The <tt>->rcu_need_heavy_qs</tt> field is used
 | |
| to record the fact that the RCU core code would really like to
 | |
| see a quiescent state from the corresponding CPU, so much so that
 | |
| it is willing to call for heavy-weight dyntick-counter operations.
 | |
| This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
 | |
| code, which provide a momentary idle sojourn in response.
 | |
| 
 | |
| </p><p>The <tt>->rcu_qs_ctr</tt> field is used to record
 | |
| quiescent states from <tt>cond_resched()</tt>.
 | |
| Because <tt>cond_resched()</tt> can execute quite frequently, this
 | |
| must be quite lightweight, as in a non-atomic increment of this
 | |
| per-CPU field.
 | |
| 
 | |
| </p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record
 | |
| the fact that the RCU core code would really like to see a quiescent
 | |
| state from the corresponding CPU, with the various other fields indicating
 | |
| just how badly RCU wants this quiescent state.
 | |
| This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
 | |
| code, which, if nothing else, non-atomically increment <tt>->rcu_qs_ctr</tt>
 | |
| in response.
 | |
| 
 | |
| <table>
 | |
| <tr><th> </th></tr>
 | |
| <tr><th align="left">Quick Quiz:</th></tr>
 | |
| <tr><td>
 | |
| 	Why not just count all NMIs?
 | |
| 	Wouldn't that be simpler and less error prone?
 | |
| </td></tr>
 | |
| <tr><th align="left">Answer:</th></tr>
 | |
| <tr><td bgcolor="#ffffff"><font color="ffffff">
 | |
| 	It seems simpler only until you think hard about how to go about
 | |
| 	updating the <tt>rcu_dynticks</tt> structure's
 | |
| 	<tt>->dynticks</tt> field.
 | |
| </font></td></tr>
 | |
| <tr><td> </td></tr>
 | |
| </table>
 | |
| 
 | |
| <p>Additional fields are present for some special-purpose
 | |
| builds, and are discussed separately.
 | |
| 
 | |
| <h3><a name="The rcu_head Structure">
 | |
| The <tt>rcu_head</tt> Structure</a></h3>
 | |
| 
 | |
| <p>Each <tt>rcu_head</tt> structure represents an RCU callback.
 | |
| These structures are normally embedded within RCU-protected data
 | |
| structures whose algorithms use asynchronous grace periods.
 | |
| In contrast, when using algorithms that block waiting for RCU grace periods,
 | |
| RCU users need not provide <tt>rcu_head</tt> structures.
 | |
| 
 | |
| </p><p>The <tt>rcu_head</tt> structure has fields as follows:
 | |
| 
 | |
| <pre>
 | |
|   1   struct rcu_head *next;
 | |
|   2   void (*func)(struct rcu_head *head);
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>->next</tt> field is used
 | |
| to link the <tt>rcu_head</tt> structures together in the
 | |
| lists within the <tt>rcu_data</tt> structures.
 | |
| The <tt>->func</tt> field is a pointer to the function
 | |
| to be called when the callback is ready to be invoked, and
 | |
| this function is passed a pointer to the <tt>rcu_head</tt>
 | |
| structure.
 | |
| However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt>
 | |
| field to record the offset of the <tt>rcu_head</tt>
 | |
| structure within the enclosing RCU-protected data structure.
 | |
| 
 | |
| </p><p>Both of these fields are used internally by RCU.
 | |
| From the viewpoint of RCU users, this structure is an
 | |
| opaque “cookie”.
 | |
| 
 | |
| <table>
 | |
| <tr><th> </th></tr>
 | |
| <tr><th align="left">Quick Quiz:</th></tr>
 | |
| <tr><td>
 | |
| 	Given that the callback function <tt>->func</tt>
 | |
| 	is passed a pointer to the <tt>rcu_head</tt> structure,
 | |
| 	how is that function supposed to find the beginning of the
 | |
| 	enclosing RCU-protected data structure?
 | |
| </td></tr>
 | |
| <tr><th align="left">Answer:</th></tr>
 | |
| <tr><td bgcolor="#ffffff"><font color="ffffff">
 | |
| 	In actual practice, there is a separate callback function per
 | |
| 	type of RCU-protected data structure.
 | |
| 	The callback function can therefore use the <tt>container_of()</tt>
 | |
| 	macro in the Linux kernel (or other pointer-manipulation facilities
 | |
| 	in other software environments) to find the beginning of the
 | |
| 	enclosing structure.
 | |
| </font></td></tr>
 | |
| <tr><td> </td></tr>
 | |
| </table>
 | |
| 
 | |
| <h3><a name="RCU-Specific Fields in the task_struct Structure">
 | |
| RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3>
 | |
| 
 | |
| <p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some
 | |
| additional fields in the <tt>task_struct</tt> structure:
 | |
| 
 | |
| <pre>
 | |
|  1 #ifdef CONFIG_PREEMPT_RCU
 | |
|  2   int rcu_read_lock_nesting;
 | |
|  3   union rcu_special rcu_read_unlock_special;
 | |
|  4   struct list_head rcu_node_entry;
 | |
|  5   struct rcu_node *rcu_blocked_node;
 | |
|  6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
 | |
|  7 #ifdef CONFIG_TASKS_RCU
 | |
|  8   unsigned long rcu_tasks_nvcsw;
 | |
|  9   bool rcu_tasks_holdout;
 | |
| 10   struct list_head rcu_tasks_holdout_list;
 | |
| 11   int rcu_tasks_idle_cpu;
 | |
| 12 #endif /* #ifdef CONFIG_TASKS_RCU */
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>->rcu_read_lock_nesting</tt> field records the
 | |
| nesting level for RCU read-side critical sections, and
 | |
| the <tt>->rcu_read_unlock_special</tt> field is a bitmask
 | |
| that records special conditions that require <tt>rcu_read_unlock()</tt>
 | |
| to do additional work.
 | |
| The <tt>->rcu_node_entry</tt> field is used to form lists of
 | |
| tasks that have blocked within preemptible-RCU read-side critical
 | |
| sections and the <tt>->rcu_blocked_node</tt> field references
 | |
| the <tt>rcu_node</tt> structure whose list this task is a member of,
 | |
| or <tt>NULL</tt> if it is not blocked within a preemptible-RCU
 | |
| read-side critical section.
 | |
| 
 | |
| <p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of
 | |
| voluntary context switches that this task had undergone at the
 | |
| beginning of the current tasks-RCU grace period,
 | |
| <tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU
 | |
| grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt>
 | |
| is a list element enqueuing this task on the holdout list,
 | |
| and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this
 | |
| idle task is running, but only if the task is currently running,
 | |
| that is, if the CPU is currently idle.
 | |
| 
 | |
| <h3><a name="Accessor Functions">
 | |
| Accessor Functions</a></h3>
 | |
| 
 | |
| <p>The following listing shows the
 | |
| <tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>,
 | |
| <tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and
 | |
| <tt>rcu_for_each_leaf_node()</tt> function and macros:
 | |
| 
 | |
| <pre>
 | |
|   1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
 | |
|   2 {
 | |
|   3   return &rsp->node[0];
 | |
|   4 }
 | |
|   5
 | |
|   6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
 | |
|   7   for ((rnp) = &(rsp)->node[0]; \
 | |
|   8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
 | |
|   9
 | |
|  10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
 | |
|  11   for ((rnp) = &(rsp)->node[0]; \
 | |
|  12        (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++)
 | |
|  13
 | |
|  14 #define rcu_for_each_leaf_node(rsp, rnp) \
 | |
|  15   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
 | |
|  16        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
 | |
| </pre>
 | |
| 
 | |
| <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the
 | |
| first element of the specified <tt>rcu_state</tt> structure's
 | |
| <tt>->node[]</tt> array, which is the root <tt>rcu_node</tt>
 | |
| structure.
 | |
| 
 | |
| </p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt>
 | |
| macro takes advantage of the layout of the <tt>rcu_node</tt>
 | |
| structures in the <tt>rcu_state</tt> structure's
 | |
| <tt>->node[]</tt> array, performing a breadth-first traversal by
 | |
| simply traversing the array in order.
 | |
| The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates
 | |
| similarly, but traverses only the first part of the array, thus excluding
 | |
| the leaf <tt>rcu_node</tt> structures.
 | |
| Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only
 | |
| the last part of the array, thus traversing only the leaf
 | |
| <tt>rcu_node</tt> structures.
 | |
| 
 | |
| <table>
 | |
| <tr><th> </th></tr>
 | |
| <tr><th align="left">Quick Quiz:</th></tr>
 | |
| <tr><td>
 | |
| 	What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and
 | |
| 	<tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree
 | |
| 	contains only a single node?
 | |
| </td></tr>
 | |
| <tr><th align="left">Answer:</th></tr>
 | |
| <tr><td bgcolor="#ffffff"><font color="ffffff">
 | |
| 	In the single-node case,
 | |
| 	<tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op
 | |
| 	and <tt>rcu_for_each_leaf_node()</tt> traverses the single node.
 | |
| </font></td></tr>
 | |
| <tr><td> </td></tr>
 | |
| </table>
 | |
| 
 | |
| <h3><a name="Summary">
 | |
| Summary</a></h3>
 | |
| 
 | |
| So each flavor of RCU is represented by an <tt>rcu_state</tt> structure,
 | |
| which contains a combining tree of <tt>rcu_node</tt> and
 | |
| <tt>rcu_data</tt> structures.
 | |
| Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle
 | |
| state is tracked by an <tt>rcu_dynticks</tt> structure.
 | |
| 
 | |
| If you made it this far, you are well prepared to read the code
 | |
| walkthroughs in the other articles in this series.
 | |
| 
 | |
| <h3><a name="Acknowledgments">
 | |
| Acknowledgments</a></h3>
 | |
| 
 | |
| I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
 | |
| Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn
 | |
| for helping me get this document into a more human-readable state.
 | |
| 
 | |
| <h3><a name="Legal Statement">
 | |
| Legal Statement</a></h3>
 | |
| 
 | |
| <p>This work represents the view of the author and does not necessarily
 | |
| represent the view of IBM.
 | |
| 
 | |
| </p><p>Linux is a registered trademark of Linus Torvalds.
 | |
| 
 | |
| </p><p>Other company, product, and service names may be trademarks or
 | |
| service marks of others.
 | |
| 
 | |
| </body></html>
 |