If a CPU is running either a userspace application or a guest OS in
nohz_full mode, it is possible for a system call to occur just as an
RCU grace period is starting. If that CPU also has the scheduling-clock
tick enabled for any reason (such as a second runnable task), and if the
system was booted with rcutree.use_softirq=0, then RCU can add insult to
injury by awakening that CPU's rcuc kthread, resulting in yet another
task and yet more OS jitter due to switching to that task, running it,
and switching back.
In addition, in the common case where that system call is not of
excessively long duration, awakening the rcuc task is pointless.
This pointlessness is due to the fact that the CPU will enter an extended
quiescent state upon returning to the userspace application or guest OS.
In this case, the rcuc kthread cannot do anything that the main RCU
grace-period kthread cannot do on its behalf, at least if it is given
a few additional milliseconds (for example, given the time duration
specified by rcutree.jiffies_till_first_fqs, give or take scheduling
delays).
This commit therefore adds a rcutree.nohz_full_patience_delay kernel
boot parameter that specifies the grace period age (in milliseconds,
rounded to jiffies) before which RCU will refrain from awakening the
rcuc kthread. Preliminary experimentation suggests a value of 1000,
that is, one second. Increasing rcutree.nohz_full_patience_delay will
increase grace-period latency and in turn increase memory footprint,
so systems with constrained memory might choose a smaller value.
Systems with less-aggressive OS-jitter requirements might choose the
default value of zero, which keeps the traditional immediate-wakeup
behavior, thus avoiding increases in grace-period latency.
[ paulmck: Apply Leonardo Bras feedback. ]
Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
Reported-by: Leonardo Bras <leobras@redhat.com>
Suggested-by: Leonardo Bras <leobras@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Leonardo Bras <leobras@redhat.com>
RCU stall printout fetches the EQS state of a CPU with a preceding full
memory barrier. However there is nothing to order this read against at
this debugging stage. It is inherently racy when performed remotely.
Do a plain read instead.
This was the last user of rcu_dynticks_snap().
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
When the boot CPU initializes the per-CPU data on behalf of all possible
CPUs, a sanity check is performed on each of them to make sure none is
initialized in an extended quiescent state.
This check involves a full memory barrier which is useless at this early
boot stage.
Do a plain access instead.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:
* If the GP kthread observes the remote target in an extended quiescent
state, then that target must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it exits that extended quiescent state.
or:
* If the GP kthread observes the remote target NOT in an extended
quiescent state, then the target further entering in an extended
quiescent state must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it enters that extended quiescent state.
This ordering is enforced through a full memory barrier placed right
before taking the first EQS snapshot. However this is superfluous
because the snapshot is taken while holding the target's rnp lock which
provides the necessary ordering through its chain of
smp_mb__after_unlock_lock().
Remove the needless explicit barrier before the snapshot and put a
comment about the implicit barrier newly relied upon here.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:
* If the GP kthread observes the remote target in an extended quiescent
state, then that target must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it exits that extended quiescent state. Also the GP kthread must
observe all accesses performed by the target prior it entering in
EQS.
or:
* If the GP kthread observes the remote target NOT in an extended
quiescent state, then the target further entering in an extended
quiescent state must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it enters that extended quiescent state. Also the GP kthread later
observing that EQS must also observe all accesses performed by the
target prior it entering in EQS.
This ordering is explicitly performed both on the first EQS snapshot
and on the second one as well through the combination of a preceding
full barrier followed by an acquire read. However the second snapshot's
full memory barrier is redundant and not needed to enforce the above
guarantees:
GP kthread Remote target
---- -----
// Access prior GP
WRITE_ONCE(A, 1)
// first snapshot
smp_mb()
x = smp_load_acquire(EQS)
// Access prior GP
WRITE_ONCE(B, 1)
// EQS enter
// implied full barrier by atomic_add_return()
atomic_add_return(RCU_DYNTICKS_IDX, EQS)
// implied full barrier by atomic_add_return()
READ_ONCE(A)
// second snapshot
y = smp_load_acquire(EQS)
z = READ_ONCE(B)
If the GP kthread above fails to observe the remote target in EQS
(x not in EQS), the remote target will observe A == 1 after further
entering in EQS. Then the second snapshot taken by the GP kthread only
need to be an acquire read in order to observe z == 1.
Therefore remove the needless full memory barrier on second snapshot.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Interrupts are enabled in rcu_gp_init(), so this commit switches from
local_irq_save() and local_irq_restore() to local_irq_disable() and
local_irq_enable().
Link: https://lore.kernel.org/all/febb13ab-a4bb-48b4-8e97-7e9f7749e6da@moroto.mountain/
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In the synchronize_rcu() common case, we will have less than
SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
is pointless just to free the last injected wait head since at that point,
all the users have already been awakened.
Introduce a new counter to track this and prevent the wakeup in the
common case.
[ paulmck: Remove atomic_dec_return_release in cannot-happen state. ]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit make rcu-tasks related rcutorture test support rcu-tasks
gp state printing when the writer stall occurs or the at the end of
rcutorture test, and generate rcu_ops->get_gp_data() operation to
simplify the acquisition of gp state for different types of rcutorture
tests.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
synchronize_rcu() users have to be processed regardless
of memory pressure so our private WQ needs to have at least
one execution context what WQ_MEM_RECLAIM flag guarantees.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
This patch introduces a small enhancement which allows to do a
direct wake-up of synchronize_rcu() callers. It occurs after a
completion of grace period, thus by the gp-kthread.
Number of clients is limited by the hard-coded maximum allowed
threshold. The remaining part, if still exists is deferred to
a main worker.
Link: https://lore.kernel.org/lkml/Zd0ZtNu+Rt0qXkfS@lothringen/
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Add an rcu_sr_normal() trace event. It takes three arguments
first one is the name of RCU flavour, second one is a user id
which triggeres synchronize_rcu_normal() and last one is an
event.
There are two traces in the synchronize_rcu_normal(). On entry,
when a new request is registered and on exit point when request
is completed.
Please note, CONFIG_RCU_TRACE=y is required to activate traces.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
A call to a synchronize_rcu() can be optimized from a latency
point of view. Workloads which depend on this can benefit of it.
The delay of wakeme_after_rcu() callback, which unblocks a waiter,
depends on several factors:
- how fast a process of offloading is started. Combination of:
- !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
- !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
- other.
- when started, invoking path is interrupted due to:
- time limit;
- need_resched();
- if limit is reached.
- where in a nocb list it is located;
- how fast previous callbacks completed;
Example:
1. On our embedded devices i can easily trigger the scenario when
it is a last in the list out of ~3600 callbacks:
<snip>
<...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
...
<...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
<...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
<...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
<snip>
2. We use cpuset/cgroup to classify tasks and assign them into
different cgroups. For example "backgrond" group which binds tasks
only to little CPUs or "foreground" which makes use of all CPUs.
Tasks can be migrated between groups by a request if an acceleration
is needed.
See below an example how "surfaceflinger" task gets migrated.
Initially it is located in the "system-background" cgroup which
allows to run only on little cores. In order to speed it up it
can be temporary moved into "foreground" cgroup which allows
to use big/all CPUs:
cgroup_attach_task():
-> cgroup_migrate_execute()
-> cpuset_can_attach()
-> percpu_down_write()
-> rcu_sync_enter()
-> synchronize_rcu()
-> now move tasks to the new cgroup.
-> cgroup_migrate_finish()
<snip>
rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
<snip>
"Boosting a task" depends on synchronize_rcu() latency:
- first trace shows a completion of synchronize_rcu();
- second shows attaching a task to a new group;
- last shows a final step when migration occurs.
3. To address this drawback, maintain a separate track that consists
of synchronize_rcu() callers only. After completion of a grace period
users are deferred to a dedicated worker to process requests.
4. This patch reduces the latency of synchronize_rcu() approximately
by ~30-40% on synthetic tests. The real test case, camera launch time,
shows(time is in milliseconds):
1-run 542 vs 489 improvement 9%
2-run 540 vs 466 improvement 13%
3-run 518 vs 468 improvement 9%
4-run 531 vs 457 improvement 13%
5-run 548 vs 475 improvement 13%
6-run 509 vs 484 improvement 4%
Synthetic test(no "noise" from other callbacks):
Hardware: x86_64 64 CPUs, 64GB of memory
Linux-6.6
- 10K tasks(simultaneous);
- each task does(1000 loops)
synchronize_rcu();
kfree(p);
default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.
Running 60K gives approximately same results on my setup. Please note
it is without any interaction with another type of callbacks, otherwise
it will impact a lot a default case.
5. By default it is disabled. To enable this perform one of the
below sequence:
echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Currently, there are rcu_data structure fields named ->rcu_onl_gp_seq
and ->rcu_ofl_gp_seq that track the rcu_state.gp_flags field at the
time of the corresponding CPU's last online or offline operation,
respectively. However, this information is not particularly useful.
It would be better to instead track the grace period state kept
in rcu_state.gp_state. This would also be consistent with the
initialization in rcu_boot_init_percpu_data(), which is to RCU_GP_CLEANED
(an rcu_state.gp_state value), and also with the diagnostics in
rcu_implicit_dynticks_qs(), whose format is consistent with an integer,
not a bitmask.
This commit therefore makes this change and changes the names to
->rcu_onl_gp_flags and ->rcu_ofl_gp_flags, respectively.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
The rcu_state.n_online_cpus value is only ever updated by CPU-hotplug
operations, which are serialized. However, this value is read locklessly.
This commit therefore marks those reads. While in the area, it also
adds ASSERT_EXCLUSIVE_WRITER() calls just in case parallel CPU hotplug
becomes a thing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
There is some indications that rcu_softirq_qs() might be more generally
used than anticipated. This commit therefore adds some lockdep assertions
and some cautionary tales in a new kernel-doc header.
Link: https://lore.kernel.org/all/Zd4DXTyCf17lcTfq@debian.debian/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Yan Zhai <yan@cloudflare.com>
Cc: <netdev@vger.kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
In order for RCU Tasks to reliably maintain per-CPU lists of exiting
tasks, those lists must be initialized before it is possible for tasks
to exit, especially given that the boot CPU is not necessarily CPU 0
(an example being, powerpc kexec() kernels). And at the time that
rcu_init_tasks_generic() is called, a task could potentially exit,
unconventional though that sort of thing might be.
This commit therefore moves the calls to cblist_init_generic() from
functions called from rcu_init_tasks_generic() to a new function named
tasks_cblist_init_generic() that is invoked from rcu_init().
This constituted a bug in a commit that never went to mainline, so
there is no need for any backporting to -stable.
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
To allow more flexible arrangements while still provide a single kernel
for distros, provide a boot time parameter to enable/disable lazy RCU.
Specify:
rcutree.enable_rcu_lazy=[y|1|n|0]
Which also requires
rcu_nocbs=all
at boot time to enable/disable lazy RCU.
To disable it by default at build time when CONFIG_RCU_LAZY=y, the new
CONFIG_RCU_LAZY_DEFAULT_OFF can be used.
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
TREE04 running on short iterations can produce writer stalls of the
following kind:
??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0
task:rcu_torture_wri state:D stack:14568 pid:83 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2de/0x850
? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0
schedule+0x4f/0x90
synchronize_rcu_expedited+0x430/0x670
? __pfx_autoremove_wake_function+0x10/0x10
? __pfx_synchronize_rcu_expedited+0x10/0x10
do_rtws_sync.constprop.0+0xde/0x230
rcu_torture_writer+0x4b4/0xcd0
? __pfx_rcu_torture_writer+0x10/0x10
kthread+0xc7/0xf0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
Waiting for an expedited grace period and polling for an expedited
grace period both are operations that internally rely on the same
workqueue performing necessary asynchronous work.
However, a dependency chain is involved between those two operations,
as depicted below:
====== CPU 0 ======= ====== CPU 1 =======
synchronize_rcu_expedited()
exp_funnel_lock()
mutex_lock(&rcu_state.exp_mutex);
start_poll_synchronize_rcu_expedited
queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
synchronize_rcu_expedited_queue_work()
queue_work(rcu_gp_wq, &rew->rew_work);
wait_event() // A, wait for &rew->rew_work completion
mutex_unlock() // B
//======> switch to kworker
sync_rcu_do_polled_gp() {
synchronize_rcu_expedited()
exp_funnel_lock()
mutex_lock(&rcu_state.exp_mutex); // C, wait B
....
} // D
Since workqueues are usually implemented on top of several kworkers
handling the queue concurrently, the above situation wouldn't deadlock
most of the time because A then doesn't depend on D. But in case of
memory stress, a single kworker may end up handling alone all the works
in a serialized way. In that case the above layout becomes a problem
because A then waits for D, closing a circular dependency:
A -> D -> C -> B -> A
This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed
synchronize_rcu_expedited() is otherwise implemented on top of a kthread
worker while polling still relies on rcu_gp_wq workqueue, breaking the
above circular dependency chain.
Fix this with making expedited grace period to always rely on kthread
worker. The workqueue based implementation is essentially a duplicate
anyway now that the per-node initialization is performed by per-node
kthread workers.
Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to
manage the scheduler policy of these kthread workers.
Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Affine the parallel expedited gp kworkers to their respective RCU node
in order to make them close to the cache their are playing with.
This reuses the boost kthreads machinery that probe into CPU hotplug
operations such that the kthreads become/stay affine to their respective
node as soon/long as they contain online CPUs. Otherwise and if the
current CPU going down was the last online on the leaf node, the related
kthread is affine to the housekeeping CPUs.
In the long run, this affinity VS CPU hotplug operation game should
probably be implemented at the generic kthread level.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
[boqun: s/* rcu_boost_task/*rcu_boost_task as reported by checkpatch]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node
initialization is performed in parallel via workqueues (one work per
node).
However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is
performed by a single kworker serializing each node initialization (one
work for all nodes).
The second part is certainly less scalable and efficient beyond a single
leaf node.
To improve this, expand this single kworker into per-node kworkers. This
new layout is eventually intended to remove the workqueues based
implementation since it will essentially now become duplicate code.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
The expedited kthread worker performing the per node initialization is
going to be split into per node kthreads. As such, the future per node
kthread creation will need to be called from CPU hotplug callbacks
instead of an initcall, right beside the per node boost kthread
creation.
To prepare for that, move the kthread worker creation above
rcutree_prepare_cpu() as a first step to make the review smoother for
the upcoming modifications.
No intended functional change.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
This mutex is currently protecting per node boost kthreads creation and
affinity setting across CPU hotplug operations.
Since the expedited kworkers will soon be split per node as well, they
will be subject to the same concurrency constraints against hotplug.
Therefore their creation and affinity tuning operations will be grouped
with those of boost kthreads and then rely on the same mutex.
To prepare for that, generalize its name.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Just like is done for the kworker performing nodes initialization,
gracefully handle the possible allocation failure of the RCU expedited
grace period main kworker.
While at it perform a rename of the related checking functions to better
reflect the expedited specifics.
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9621fbee44 ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited
grace periods is queued to a kworker. However if the allocation of that
kworker failed, the nodes initialization is performed synchronously by
the caller instead.
Now the check for kworker initialization failure relies on the kworker
pointer to be NULL while its value might actually encapsulate an
allocation failure error.
Make sure to handle this case.
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9621fbee44 ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Currently the call_rcu() function interleaves NOCB and !NOCB enqueue
code in a complicated way such that:
* The bypass enqueue code may or may not have enqueued and may or may
not have locked the ->nocb_lock. Everything that follows is in a
Schrödinger locking state for the unwary reviewer's eyes.
* The was_alldone is always set but only used in NOCB related code.
* The NOCB wake up is distantly related to the locking hopefully
performed by the bypass enqueue code that did not enqueue on the
bypass list.
Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to
their own functions.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Currently IRQs are disabled on call_rcu() and then depending on the
context:
* If the CPU is in nocb mode:
- If the callback is enqueued in the bypass list, IRQs are re-enabled
implictly by rcu_nocb_try_bypass()
- If the callback is enqueued in the normal list, IRQs are re-enabled
implicitly by __call_rcu_nocb_wake()
* If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu()
This makes the code a bit hard to follow, especially as it interleaves
with nocb locking.
To make the IRQ flags coverage clearer and also in order to prepare for
moving all the nocb enqueue code to its own function, always re-enable
the IRQ flags explicitly from call_rcu().
Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
A full barrier is issued from nocb_gp_wait() upon callbacks advancing
to order grace period completion with callbacks execution.
However these two events are already ordered by the
smp_mb__after_unlock_lock() barrier within the call to
raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
happen.
The following litmus test shows the kind of guarantee that this barrier
provides:
C smp_mb__after_unlock_lock
{}
// rcu_gp_cleanup()
P0(spinlock_t *rnp_lock, int *gpnum)
{
// Grace period cleanup increase gp sequence number
spin_lock(rnp_lock);
WRITE_ONCE(*gpnum, 1);
spin_unlock(rnp_lock);
}
// nocb_gp_wait()
P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
{
int r1;
// Call rcu_advance_cbs() from nocb_gp_wait()
spin_lock(nocb_lock);
spin_lock(rnp_lock);
smp_mb__after_unlock_lock();
r1 = READ_ONCE(*gpnum);
WRITE_ONCE(*cb_ready, 1);
spin_unlock(rnp_lock);
spin_unlock(nocb_lock);
}
// nocb_cb_wait()
P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
{
int r2;
// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
spin_lock(nocb_lock);
r2 = READ_ONCE(*cb_ready);
spin_unlock(nocb_lock);
// Actual callback execution
WRITE_ONCE(*cb_executed, 1);
}
P3(int *cb_executed, int *gpnum)
{
int r3;
WRITE_ONCE(*cb_executed, 2);
smp_mb();
r3 = READ_ONCE(*gpnum);
}
exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *)
Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is
removed. This barrier orders the grace period completion against
callbacks advancing and even later callbacks invocation, thanks to the
opportunistic propagation via the ->nocb_lock to nocb_cb_wait().
Therefore the smp_mb() placed after callbacks advancing can be safely
removed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).
If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.
This triggers TREE03 rcutorture hangs:
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2eb/0xa80
schedule+0x1f/0x90
schedule_timeout+0x163/0x270
? __pfx_process_timeout+0x10/0x10
rcu_gp_fqs_loop+0x37c/0x5b0
? __pfx_rcu_gp_kthread+0x10/0x10
rcu_gp_kthread+0x17c/0x200
kthread+0xde/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2b/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.
So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
If an rcutorture test scenario creates an fqs_task kthread, it will
periodically invoke rcu_force_quiescent_state() in order to start
force-quiescent-state (FQS) operations. However, an FQS operation
will be started even if there is no RCU grace period in progress.
Although testing FQS operations startup when there is no grace period in
progress is necessary, it need not happen all that often. This commit
therefore causes rcu_force_quiescent_state() to take an early exit
if there is no grace period in progress.
Note that there will still be attempts to start an FQS scan in the
absence of a grace period because the grace period might end right
after the rcu_force_quiescent_state() function's check. In actual
testing, this happens about once every ten minutes, which should
provide adequate testing.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
* Fix a lock inversion between scheduler and RCU introduced
in v6.2-rc4. The scenario could trigger on any user of RCU_NOCB
(mostly Android and also nohz_full).
* Fix PF_IDLE semantic changes introduced in v6.6-rc3 breaking some
RCU-Tasks and RCU-Tasks-Trace expectations as to what exactly is
an idle task. This resulted in potential spurious stalls and
warnings.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmVFB/QACgkQhSRUR1CO
jHctUg/+LjAHNvT8YR99qxP3v8Ylqf191ADhDla1Zx1LYkhcrApdRKQU5G1dzfiF
arJ/KSSgCLCYMecwdh6iOL5s3S3DtQGsh7Y7f/dJ6vwJbNe2NbPaMYK2WQ171zT+
ht7oAugFrOpWn5H6i1eC170N+rG7nHYuCXrzsta0fx1fOp9G0OeemvweD7HjpaB7
66tp0KnuJWLfPLoZucscxKRtC7ZG0zDztFUXcfcv+TQE6w5fXuhtLX7E7fE1jPMr
Pa3x9Su5fjFCRJCCCFNHzKqbf8VKG/teGOONFBwt8AwwU1/7TuIuwSOjrISOaG91
eEeIwWyiAw+aP+7CMMGca0oYPOR+b3Xq7vmN85KIEFUHpNls6MvR5/itR1d3/FsA
9EkiNLIwbRV9ICi+JeGeJsSmm6TRJL4XToiw9fJKfC05IHjotMSAtUOqHS/VmdjL
o1kDdiWBSHUy7VVelu5wRHb0WINc9Zrq/tc3MT2W0EpTDRSL2r04MsP8UDYre1oY
KDdTLSPsy2D+5YWY484Z+bCOHl0bXSgZe4PycGVrP9HEus6MGHyxU6uDCXwWbqQz
R0QdZ60AdquwWTdK1K0PsU5VKQw3oZKKwj8hyKOr31NpSZ1FvLJa8Ct8D2UZPbMS
WNscgGE4ZNjCYFq8wtnL5LkUym8is1Y6/nLifBV1scx5GtyHYiU=
=a5L7
-----END PGP SIGNATURE-----
Merge tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull RCU fixes from Frederic Weisbecker:
- Fix a lock inversion between scheduler and RCU introduced in
v6.2-rc4. The scenario could trigger on any user of RCU_NOCB
(mostly Android but also nohz_full)
- Fix PF_IDLE semantic changes introduced in v6.6-rc3 breaking
some RCU-Tasks and RCU-Tasks-Trace expectations as to what
exactly is an idle task. This resulted in potential spurious
stalls and warnings.
* tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
rcu/tasks-trace: Handle new PF_IDLE semantics
rcu/tasks: Handle new PF_IDLE semantics
rcu: Introduce rcu_cpu_online()
rcu: Break rcu_node_0 --> &rq->__lock order
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
Export the RCU point of view as to when a CPU is considered offline
(ie: when does RCU consider that a CPU is sufficiently down in the
hotplug process to not feature any possible read side).
This will be used by RCU-tasks whose vision of an offline CPU should
reasonably match the one of RCU core.
Fixes: cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Commit 851a723e45 ("sched: Always clear user_cpus_ptr in
do_set_cpus_allowed()") added a kfree() call to free any user
provided affinity mask, if present. It was changed later to use
kfree_rcu() in commit 9a5418bc48 ("sched/core: Use kfree_rcu()
in do_set_cpus_allowed()") to avoid a circular locking dependency
problem.
It turns out that even kfree_rcu() isn't safe for avoiding
circular locking problem. As reported by kernel test robot,
the following circular locking dependency now exists:
&rdp->nocb_lock --> rcu_node_0 --> &rq->__lock
Solve this by breaking the rcu_node_0 --> &rq->__lock chain by moving
the resched_cpu() out from under rcu_node lock.
[peterz: heavily borrowed from Waiman's Changelog]
[paulmck: applied Z qiang feedback]
Fixes: 851a723e45 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/oe-lkp/202310302207.a25f1a30-oliver.sang@intel.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
rcu_report_dead() and rcutree_migrate_callbacks() have their headers in
rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug
functions.
Also rcu_cpu_starting() and rcu_report_dead() have different naming
conventions while they mirror each other's effects.
Fix the headers and propose a naming that relates both functions and
aligns with the prefix of other rcutree CPU-hotplug functions.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Among the three CPU-hotplug teardown RCU callbacks, two of them early
exit if CONFIG_HOTPLUG_CPU=n, and one is left unchanged. In any case
all of them have an implementation when CONFIG_HOTPLUG_CPU=n.
Align instead with the common way to deal with CPU-hotplug teardown
callbacks and provide a proper stub when they are not supported.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
rcu_report_dead() has to be called locally by the CPU that is going to
exit the RCU state machine. Passing a cpu argument here is error-prone
and leaves the possibility for a racy remote call.
Use local access instead.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
rcu_report_dead() is the last RCU word from the CPU down through the
hotplug path. It is called in the idle loop right before the CPU shuts
down for good. Because it removes the CPU from the grace period state
machine and reports an ultimate quiescent state if necessary, no further
use of RCU is allowed. Therefore it is expected that IRQs are disabled
upon calling this function and are not to be re-enabled again until the
CPU shuts down.
Remove the IRQs disablement from that function and verify instead that
it is actually called with IRQs disabled as it is expected at that
special point in the idle path.
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Since the actual slab freeing is deferred when calling kvfree_rcu(), so
is the kmemleak_free() callback informing kmemleak of the object
deletion. From the perspective of the kvfree_rcu() caller, the object is
freed and it may remove any references to it. Since kmemleak does not
scan RCU internal data storing the pointer, it will report such objects
as leaks during the grace period.
Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note
that the tiny RCU implementation does not have such issue since the
objects can be tracked from the rcu_ctrlblk structure.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/
Cc: <stable@vger.kernel.org>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
When using rcutorture as a module, there are a number of conditions that
can abort the modprobe operation, for example, when attempting to run
both RCU CPU stall warning tests and forward-progress tests. This can
cause rcu_torture_cleanup() to be invoked on the unwind path out of
rcu_rcu_torture_init(), which will mean that rcu_gp_slow_unregister()
is invoked without a matching rcu_gp_slow_register(). This will cause
a splat because rcu_gp_slow_unregister() is passed rcu_fwd_cb_nodelay,
which does not match a NULL pointer.
This commit therefore forgives a mismatch involving a NULL pointer, thus
avoiding this false-positive splat.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
When a structure containing an RCU callback rhp is (incorrectly) freed
and reallocated after rhp is passed to call_rcu(), it is not unusual for
rhp->func to be set to NULL. This defeats the debugging prints used by
__call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y,
which expect to identify the offending code using the identity of this
function.
And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things
are even worse, as can be seen from this splat:
Unable to handle kernel NULL pointer dereference at virtual address 0
... ...
PC is at 0x0
LR is at rcu_do_batch+0x1c0/0x3b8
... ...
(rcu_do_batch) from (rcu_core+0x1d4/0x284)
(rcu_core) from (__do_softirq+0x24c/0x344)
(__do_softirq) from (__irq_exit_rcu+0x64/0x108)
(__irq_exit_rcu) from (irq_exit+0x8/0x10)
(irq_exit) from (__handle_domain_irq+0x74/0x9c)
(__handle_domain_irq) from (gic_handle_irq+0x8c/0x98)
(gic_handle_irq) from (__irq_svc+0x5c/0x94)
(__irq_svc) from (arch_cpu_idle+0x20/0x3c)
(arch_cpu_idle) from (default_idle_call+0x4c/0x78)
(default_idle_call) from (do_idle+0xf8/0x150)
(do_idle) from (cpu_startup_entry+0x18/0x20)
(cpu_startup_entry) from (0xc01530)
This commit therefore adds calls to mem_dump_obj(rhp) to output some
information, for example:
slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256
This provides the rough size of the memory block and the offset of the
rcu_head structure, which as least provides at least a few clues to help
locate the problem. If the problem is reproducible, additional slab
debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can
provide significantly more information.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
When running a series of stress tests all making heavy use of RCU,
it is all too possible to OOM the system when the prior test's RCU
callbacks don't get invoked until after the subsequent test starts.
One way of handling this is just a timed wait, but this fails when a
given CPU has so many callbacks queued that they take longer to invoke
than allowed for by that timed wait.
This commit therefore adds an rcutree.do_rcu_barrier module parameter that
is accessible from sysfs. Writing one of the many synonyms for boolean
"true" will cause an rcu_barrier() to be invoked, but will guarantee that
no more than one rcu_barrier() will be invoked per sixteenth of a second
via this mechanism. The flip side is that a given request might wait a
second or three longer than absolutely necessary, but only when there are
multiple uses of rcutree.do_rcu_barrier within a one-second time interval.
This commit unnecessarily serializes the rcu_barrier() machinery, given
that serialization is already provided by procfs. This has the advantage
of allowing throttled rcu_barrier() from other sources within the kernel.
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The return keyword is not needed here.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
There are instances where rcu_cpu_stall_reset() is called when jiffies
did not get a chance to update for a long time. Before jiffies is
updated, the CPU stall detector can go off triggering false-positives
where a just-started grace period appears to be ages old. In the past,
we disabled stall detection in rcu_cpu_stall_reset() however this got
changed [1]. This is resulting in false-positives in KGDB usecase [2].
Fix this by deferring the update of jiffies to the third run of the FQS
loop. This is more robust, as, even if rcu_cpu_stall_reset() is called
just before jiffies is read, we would end up pushing out the jiffies
read by 3 more FQS loops. Meanwhile the CPU stall detection will be
delayed and we will not get any false positives.
[1] https://lore.kernel.org/all/20210521155624.174524-2-senozhatsky@chromium.org/
[2] https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/
Tested with rcutorture.cpu_stall option as well to verify stall behavior
with/without patch.
Tested-by: Huacai Chen <chenhuacai@loongson.cn>
Reported-by: Binbin Zhou <zhoubinbin@loongson.cn>
Closes: https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/
Suggested-by: Paul McKenney <paulmck@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: a80be428fb ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The rcu_request_urgent_qs_task() function does a cross-CPU store
to ->rcu_urgent_qs, so this commit therefore marks the load in
__rcu_irq_enter_check_tick() with READ_ONCE().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Make it clear that this function always returns either true or false
without other planned failure modes.
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
The rcu_tasks_invoke_cbs() function relies on queue_work_on() to silently
fall back to WORK_CPU_UNBOUND when the specified CPU is offline. However,
the queue_work_on() function's silent fallback mechanism relies on that
CPU having been online at some time in the past. When queue_work_on()
is passed a CPU that has never been online, workqueue lockups ensue,
which can be bad for your kernel's general health and well-being.
This commit therefore checks whether a given CPU has ever been online,
and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to
queue_work_on(). Why not simply omit the queue_work_on() call entirely?
Because this function is flooding callback-invocation notifications
to all CPUs, and must deal with possibilities that include a sparse
cpu_possible_mask.
This commit also moves the setting of the rcu_data structure's
->beenonline field to rcu_cpu_starting(), which executes on the
incoming CPU before that CPU has ever enabled interrupts. This ensures
that the required workqueues are present. In addition, because the
incoming CPU has not yet enabled its interrupts, there cannot yet have
been any softirq handlers running on this CPU, which means that the
WARN_ON_ONCE(!rdp->beenonline) within the RCU_SOFTIRQ handler cannot
have triggered yet.
Fixes: d363f833c6 ("rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_cpu_starting() is written so that it might be invoked
with interrupts enabled. However, it is always called when interrupts
are disabled, either by rcu_init(), notify_cpu_starting(), or from a
call point prior to the call to notify_cpu_starting().
But why bother requiring that interrupts be disabled? The purpose is
to allow the rcu_data structure's ->beenonline flag to be set after all
early processing has completed for the incoming CPU, thus allowing this
flag to be used to determine when workqueues have been set up for the
incoming CPU, while still allowing this flag to be used as a diagnostic
within rcu_core().
This commit therefore makes rcu_cpu_starting() rely on interrupts being
disabled.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_data structure's ->rcu_cpu_has_work field can be modified by
any CPU attempting to wake up the rcuc kthread. Therefore, this commit
marks accesses to this field from the rcu_cpu_kthread() function.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, if there are more than 100 ready-to-invoke RCU callbacks queued
on a given CPU, the rcu_do_batch() function sets a timeout for invocation
of the series. This timeout defaulting to three milliseconds, and may
be adjusted using the rcutree.rcu_resched_ns kernel boot parameter.
This timeout is checked using local_clock(), but the overhead of this
function combined with the common-case very small callback-invocation
overhead means that local_clock() is checked every 32nd invocation.
This works well except for longer-than average callbacks. For example,
a series of 500-microsecond-duration callbacks means that local_clock()
is checked only once every 16 milliseconds, which makes it difficult to
enforce a three-millisecond timeout.
This commit therefore adds a Kconfig option RCU_DOUBLE_CHECK_CB_TIME
that enables backup timeout checking using the coarser grained but
lighter weight jiffies. If the jiffies counter detects a timeout,
then local_clock() is consulted even if this is not the 32nd callback.
This prevents the aforementioned 16-millisecond latency blow.
Reported-by: Domas Mituzas <dmituzas@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, a callback-invocation time limit is enforced only for
callbacks invoked from the softirq environment, the rationale being
that when callbacks are instead invoked from rcuc and rcuoc kthreads,
these callbacks cannot be holding up other softirq vectors.
Which is in fact true. However, if an rcuc kthread spends too much time
invoking callbacks, it can delay quiescent-state reports from its CPU,
which can also be a problem.
This commit therefore applies the callback-invocation time limit to
callback invocation from the rcuc kthreads as well as from softirq.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If the rcutree.rcu_min_cached_objs kernel boot parameter is set to zero,
then krcp->page_cache_work will never be triggered to fill page cache.
In addition, the put_cached_bnode() will not fill page cache. As a
result krcp->bkvcache will always be empty, so there is no need to acquire
krcp->lock to get page from krcp->bkvcache. This commit therefore makes
drain_page_cache() return immediately if the rcu_min_cached_objs is zero.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the fill_page_cache_func() function is invoked, it assumes that
the cache of pages is completely empty. However, there can be some time
between triggering execution of this function and its actual invocation.
During this time, kfree_rcu_work() might run, and might fill in part or
all of this cache of pages, thus invalidating the fill_page_cache_func()
function's assumption.
This will not overfill the cache because put_cached_bnode() will reject
the extra page. However, it will result in a needless allocation and
freeing of one extra page, which might not be helpful under lowish-memory
conditions.
This commit therefore causes the fill_page_cache_func() to explicitly
account for pages that have been placed into the cache shortly before
it starts running.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
By default the cache size is 5 pages per CPU, but it can be disabled at
boot time by setting the rcu_min_cached_objs to zero. When that happens,
the current code will uselessly set an hrtimer to schedule refilling this
cache with zero pages. This commit therefore streamlines this process
by simply refusing the set the hrtimer when rcu_min_cached_objs is zero.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The add_ptr_to_bulk_krc_lock() function is invoked to allocate a new
kfree_rcu() page, also known as a kvfree_rcu_bulk_data structure.
The kfree_rcu_cpu structure's lock is used to protect this operation,
except that this lock must be momentarily dropped when allocating memory.
It is clearly important that the lock that is reacquired be the same
lock that was acquired initially via krc_this_cpu_lock().
Unfortunately, this same krc_this_cpu_lock() function is used to
re-acquire this lock, and if the task migrated to some other CPU during
the memory allocation, this will result in the kvfree_rcu_bulk_data
structure being added to the wrong CPU's kfree_rcu_cpu structure.
This commit therefore replaces that second call to krc_this_cpu_lock()
with raw_spin_lock_irqsave() in order to explicitly acquire the lock on
the correct kfree_rcu_cpu structure, thus keeping things straight even
when the task migrates.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If kvfree_rcu_bulk() sees that the required grace period has failed to
elapse, it leaks the memory because readers might still be using it.
But in that case, the debug-objects subsystem still marks the relevant
structures as having been freed, even though they are instead being
leaked.
This commit fixes this mismatch by invoking debug_rcu_bhead_unqueue()
only when we are actually going to free the objects.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Under low-memory conditions, kvfree_rcu() will use each object's
rcu_head structure to queue objects in a singly linked list headed by
the kfree_rcu_cpu structure's ->head field. This list is passed to
call_rcu() as a unit, but there is no indication of which grace period
this list needs to wait for. This in turn prevents adding debug checks
in the kfree_rcu_work() as was done for the two page-of-pointers channels
in the kfree_rcu_cpu structure.
This commit therefore adds a ->head_free_gp_snap field to the
kfree_rcu_cpu_work structure to record this grace-period number. It also
adds a WARN_ON_ONCE() to kfree_rcu_monitor() that checks to make sure
that the required grace period has in fact elapsed.
[ paulmck: Fix kerneldoc issue raised by Stephen Rothwell. ]
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds debugging checks to verify that the required RCU
grace period has elapsed for each kvfree_rcu_bulk_data structure that
arrives at the kvfree_rcu_bulk() function. These checks make use
of that structure's ->gp_snap field, which has been upgraded from an
unsigned long to an rcu_gp_oldstate structure. This upgrade reduces
the chances of false positives to nearly zero, even on 32-bit systems,
for which this structure carries 64 bits of state.
Cc: Ziwei Dai <ziwei.dai@unisoc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
o MAINTAINERS files additions and changes.
o Fix hotplug warning in nohz code.
o Tick dependency changes by Zqiang.
o Lazy-RCU shrinker fixes by Zqiang.
o rcu-tasks stall reporting improvements by Neeraj.
o Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
name for robustness.
o Documentation Updates:
o Significant changes to srcu_struct size.
o Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
o rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
o Other misc changes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
+Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
=Rgbj
-----END PGP SIGNATURE-----
Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux
Pull RCU updates from Joel Fernandes:
- Updates and additions to MAINTAINERS files, with Boqun being added to
the RCU entry and Zqiang being added as an RCU reviewer.
I have also transitioned from reviewer to maintainer; however, Paul
will be taking over sending RCU pull-requests for the next merge
window.
- Resolution of hotplug warning in nohz code, achieved by fixing
cpu_is_hotpluggable() through interaction with the nohz subsystem.
Tick dependency modifications by Zqiang, focusing on fixing usage of
the TICK_DEP_BIT_RCU_EXP bitmask.
- Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
kernels, fixed by Zqiang.
- Improvements to rcu-tasks stall reporting by Neeraj.
- Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
increased robustness, affecting several components like mac802154,
drbd, vmw_vmci, tracing, and more.
A report by Eric Dumazet showed that the API could be unknowingly
used in an atomic context, so we'd rather make sure they know what
they're asking for by being explicit:
https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/
- Documentation updates, including corrections to spelling,
clarifications in comments, and improvements to the srcu_size_state
comments.
- Better srcu_struct cache locality for readers, by adjusting the size
of srcu_struct in support of SRCU usage by Christoph Hellwig.
- Teach lockdep to detect deadlocks between srcu_read_lock() vs
synchronize_srcu() contributed by Boqun.
Previously lockdep could not detect such deadlocks, now it can.
- Integration of rcutorture and rcu-related tools, targeted for v6.4
from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
module parameter, and more
- Miscellaneous changes, various code cleanups and comment improvements
* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
checkpatch: Error out if deprecated RCU API used
mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
rcu/trace: use strscpy() to instead of strncpy()
tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
...
Memory passed to kvfree_rcu() that is to be freed is tracked by a
per-CPU kfree_rcu_cpu structure, which in turn contains pointers
to kvfree_rcu_bulk_data structures that contain pointers to memory
that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
structure that tracks the memory that has already been handed to RCU.
These structures track three categories of memory: (1) Memory for
kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
during an OOM episode. The first two categories are tracked in a
cache-friendly manner involving a dynamically allocated page of pointers
(the aforementioned kvfree_rcu_bulk_data structures), while the third
uses a simple (but decidedly cache-unfriendly) linked list through the
rcu_head structures in each block of memory.
On a given CPU, these three categories are handled as a unit, with that
CPU's kfree_rcu_cpu_work structure having one pointer for each of the
three categories. Clearly, new memory for a given category cannot be
placed in the corresponding kfree_rcu_cpu_work structure until any old
memory has had its grace period elapse and thus has been removed. And
the kfree_rcu_monitor() function does in fact check for this.
Except that the kfree_rcu_monitor() function checks these pointers one
at a time. This means that if the previous kfree_rcu() memory passed
to RCU had only category 1 and the current one has only category 2, the
kfree_rcu_monitor() function will send that current category-2 memory
along immediately. This can result in memory being freed too soon,
that is, out from under unsuspecting RCU readers.
To see this, consider the following sequence of events, in which:
o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
then is preempted.
o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset"
after a later grace period. Except that "from_cset" is freed
right after the previous grace period ended, so that "from_cset"
is immediately freed. Task A resumes and references "from_cset"'s
member, after which nothing good happens.
In full detail:
CPU 0 CPU 1
---------------------- ----------------------
count_memcg_event_mm()
|rcu_read_lock() <---
|mem_cgroup_from_task()
|// css_set_ptr is the "from_cset" mentioned on CPU 1
|css_set_ptr = rcu_dereference((task)->cgroups)
|// Hard irq comes, current task is scheduled out.
cgroup_attach_task()
|cgroup_migrate()
|cgroup_migrate_execute()
|css_set_move_task(task, from_cset, to_cset, true)
|cgroup_move_task(task, to_cset)
|rcu_assign_pointer(.., to_cset)
|...
|cgroup_migrate_finish()
|put_css_set_locked(from_cset)
|from_cset->refcount return 0
|kfree_rcu(cset, rcu_head) // free from_cset after new gp
|add_ptr_to_bulk_krc_lock()
|schedule_delayed_work(&krcp->monitor_work, ..)
kfree_rcu_monitor()
|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
|queue_rcu_work(system_wq, &krwp->rcu_work)
|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp
// There is a perious call_rcu(.., rcu_work_rcufn)
// gp end, rcu_work_rcufn() is called.
rcu_work_rcufn()
|__queue_work(.., rwork->wq, &rwork->work);
|kfree_rcu_work()
|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
|The "from_cset" is freed before new gp end.
// the task resumes some time later.
|css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
This commit therefore causes kfree_rcu_monitor() to refrain from moving
kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
grace period has completed for all three categories.
v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
Fixes: 34c8817455 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Fixes: 5f3c8d6204 ("rcu/tree: Maintain separate array for vmalloc ptrs")
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
According to the commit log of the patch that added it to the kernel,
start_poll_synchronize_rcu_expedited() can be invoked very early, as
in long before rcu_init() has been invoked. But before rcu_init(),
the rcu_data structure's ->mynode field has not yet been initialized.
This means that the start_poll_synchronize_rcu_expedited() function's
attempt to set the CPU's leaf rcu_node structure's ->exp_seq_poll_rq
field will result in a segmentation fault.
This commit therefore causes start_poll_synchronize_rcu_expedited() to
set ->exp_seq_poll_rq only after rcu_init() has initialized all CPUs'
rcu_data structures' ->mynode fields. It also removes the check from
the rcu_init() function so that start_poll_synchronize_rcu_expedited(
is unconditionally invoked. Yes, this might result in an unnecessary
boot-time grace period, but this is down in the noise.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
The rcu_accelerate_cbs() function is invoked by rcu_report_qs_rdp()
only if there is a grace period in progress that is still blocked
by at least one CPU on this rcu_node structure. This means that
rcu_accelerate_cbs() should never return the value true, and thus that
this function should never set the needwake variable and in turn never
invoke rcu_gp_kthread_wake().
This commit therefore removes the needwake variable and the invocation
of rcu_gp_kthread_wake() in favor of a WARN_ON_ONCE() on the call to
rcu_accelerate_cbs(). The purpose of this new WARN_ON_ONCE() is to
detect situations where the system's opinion differs from ours.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
This commit adds a comment to help explain why the "else" clause of the
in_serving_softirq() "if" statement does not need to enforce a time limit.
The reason is that this "else" clause handles rcuoc kthreads that do not
block handlers for other softirq vectors.
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
During suspend, we see failures to suspend 1 in 300-500 suspends.
Looking closer, it appears that asynchronous RCU callbacks are being
queued as lazy even though synchronous callbacks are expedited. These
delays appear to not be very welcome by the suspend/resume code as
evidenced by these occasional suspend failures.
This commit modifies call_rcu() to check if rcu_async_should_hurry(),
which will return true if we are in suspend or in-kernel boot.
[ paulmck: Alphabetize local variables. ]
Ignoring the lazy hint makes the 3000 suspend/resume cycles pass
reliably on a 12th gen 12-core Intel CPU, and there is some evidence
that it also slightly speeds up boot performance.
Fixes: 3cb278e73b ("rcu: Make call_rcu() lazy to save power")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Boot and suspend/resume should not be slowed down in kernels built with
CONFIG_RCU_LAZY=y. In particular, suspend can sometimes fail in such
kernels.
This commit therefore adds rcu_async_hurry(), rcu_async_relax(), and
rcu_async_should_hurry() functions that track whether or not either
a boot or a suspend/resume operation is in progress. This will
enable a later commit to refrain from laziness during those times.
Export rcu_async_should_hurry(), rcu_async_hurry(), and rcu_async_relax()
for later use by rcutorture.
[ paulmck: Apply feedback from Steve Rostedt. ]
Fixes: 3cb278e73b ("rcu: Make call_rcu() lazy to save power")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_boost_kthread_setaffinity() function is invoked at
rcutree_online_cpu() and rcutree_offline_cpu() time, early in the online
timeline and late in the offline timeline, respectively. It is also
invoked from rcutree_dead_cpu(), however, in the absence of userspace
manipulations (for which userspace must take responsibility), this call
is redundant with that from rcutree_offline_cpu(). This redundancy can
be demonstrated by printing out the relevant cpumasks
This commit therefore removes the call to rcu_boost_kthread_setaffinity()
from rcutree_dead_cpu().
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Because RCU CPU stall warnings are driven from the scheduling-clock
interrupt handler, a workload consisting of a very large number of
short-duration hardware interrupts can result in misleading stall-warning
messages. On systems supporting only a single level of interrupts,
that is, where interrupts handlers cannot be interrupted, this can
produce misleading diagnostics. The stack traces will show the
innocent-bystander interrupted task, not the interrupts that are
at the very least exacerbating the stall.
This situation can be improved by displaying the number of interrupts
and the CPU time that they have consumed. Diagnosing other types
of stalls can be eased by also providing the count of softirqs and
the CPU time that they consumed as well as the number of context
switches and the task-level CPU time consumed.
Consider the following output given this change:
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 0-....: (1250 ticks this GP) <omitted>
rcu: hardirqs softirqs csw/system
rcu: number: 624 45 0
rcu: cputime: 69 1 2425 ==> 2500(ms)
This output shows that the number of hard and soft interrupts is small,
there are no context switches, and the system takes up a lot of time. This
indicates that the current task is looping with preemption disabled.
The impact on system performance is negligible because snapshot is
recorded only once for all continuous RCU stalls.
This added debugging information is suppressed by default and can be
enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or
by booting with rcupdate.rcu_cpu_stall_cputime=1.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This patch splits the lists of objects so as to avoid sending any
through RCU that have already been queued for more than one grace
period. These long-term-resident objects are immediately freed.
The remaining short-term-resident objects are queued for later freeing
using queue_rcu_work().
This change avoids delaying workqueue handlers with synchronize_rcu()
invocations. Yes, workqueue handlers are designed to handle blocking,
but avoiding blocking when unnecessary improves performance during
low-memory situations.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The schedule_delayed_monitor_work() function relies on the count of
objects queued into any given kfree_rcu_cpu structure. This count is
used to determine how quickly to schedule passing these objects to RCU.
There are three pipes where pointers can be placed. When any pipe is
offloaded, the kfree_rcu_cpu structure's ->count counter is set to zero,
which is wrong because the other pipes might still be non-empty.
This commit therefore maintains per-pipe counters, and introduces a
krc_count() helper to access the aggregate value of those counters.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The need_offload_krc() function is now lock-free, which gives the
compiler freedom to load old values from plain C-language loads from
the kfree_rcu_cpu struture's ->head pointer. This commit therefore
applied READ_ONCE() to these loads.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently all objects placed into a batch wait for a full grace period
to elapse after that batch is ready to send to RCU. However, this
can unnecessarily delay freeing of the first objects that were added
to the batch. After all, several RCU grace periods might have elapsed
since those objects were added, and if so, there is no point in further
deferring their freeing.
This commit therefore adds per-page grace-period snapshots which are
obtained from get_state_synchronize_rcu(). When the batch is ready
to be passed to call_rcu(), each page's snapshot is checked by passing
it to poll_state_synchronize_rcu(). If a given page's RCU grace period
has already elapsed, its objects are freed immediately by kvfree_rcu_bulk().
Otherwise, these objects are freed after a call to synchronize_rcu().
This approach requires that the pages be traversed in reverse order,
that is, the oldest ones first.
Test example:
kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
--kconfig CONFIG_NR_CPUS=64 \
--kconfig CONFIG_RCU_NOCB_CPU=y \
--kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
--kconfig CONFIG_RCU_LAZY=n \
--bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 \
rcuscale.holdoff=20 rcuscale.kfree_loops=10000 \
torture.disable_onoff_at_boot" --trust-make
Before this commit:
Total time taken by all kfree'ers: 8535693700 ns, loops: 10000, batches: 1188, memory footprint: 2248MB
Total time taken by all kfree'ers: 8466933582 ns, loops: 10000, batches: 1157, memory footprint: 2820MB
Total time taken by all kfree'ers: 5375602446 ns, loops: 10000, batches: 1130, memory footprint: 6502MB
Total time taken by all kfree'ers: 7523283832 ns, loops: 10000, batches: 1006, memory footprint: 3343MB
Total time taken by all kfree'ers: 6459171956 ns, loops: 10000, batches: 1150, memory footprint: 6549MB
After this commit:
Total time taken by all kfree'ers: 8560060176 ns, loops: 10000, batches: 1787, memory footprint: 61MB
Total time taken by all kfree'ers: 8573885501 ns, loops: 10000, batches: 1777, memory footprint: 93MB
Total time taken by all kfree'ers: 8320000202 ns, loops: 10000, batches: 1727, memory footprint: 66MB
Total time taken by all kfree'ers: 8552718794 ns, loops: 10000, batches: 1790, memory footprint: 75MB
Total time taken by all kfree'ers: 8601368792 ns, loops: 10000, batches: 1724, memory footprint: 62MB
The reduction in memory footprint is well in excess of an order of
magnitude.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The need_offload_krc() function currently holds the krcp->lock in order
to safely check krcp->head. This commit removes the need for this lock
in that function by updating the krcp->head pointer using WRITE_ONCE()
macro so that readers can carry out lockless loads of that pointer.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The kvfree_rcu() code maintains lists of pages of pointers, but also a
singly linked list, with the latter being used when memory allocation
fails. Traversal of these two types of lists is currently open coded.
This commit simplifies the code by providing kvfree_rcu_bulk() and
kvfree_rcu_list() functions, respectively, to traverse these two types
of lists. This patch does not introduce any functional change.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit improves the readability and maintainability of the
kvfree_rcu() code by switching from an open-coded linked list to
the standard Linux-kernel circular doubly linked list. This patch
does not introduce any functional change.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently a kvfree_call_rcu() takes an offset within a structure as
a second parameter, so a helper such as a kvfree_rcu_arg_2() has to
convert rcu_head and a freed ptr to an offset in order to pass it. That
leads to an extra conversion on macro entry.
Instead of converting, refactor the code in way that a pointer that has
to be freed is passed directly to the kvfree_call_rcu().
This patch does not make any functional change and is transparent to
all kvfree_rcu() users.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit tests synchronize_rcu() and synchronize_rcu_expedited()
at the end of rcu_init(), in addition to the test already at the
beginning of that function. These tests are run only in kernels built
with CONFIG_PROVE_RCU=y.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_blocking_is_gp() invokes might_sleep() even during early
boot when interrupts are disabled and before the scheduler is scheduling.
This is at best an accident waiting to happen. Therefore, this commit
moves that might_sleep() under an rcu_scheduler_active check in order
to ensure that might_sleep() is not invoked unless sleeping might actually
happen.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit emphasizes the possibility of concurrent calls to
synchronize_rcu() and synchronize_rcu_expedited() causing one or
the other of the two grace periods being lost from the viewpoint of
poll_state_synchronize_rcu().
If you cannot afford to lose grace periods this way, you should
instead use the _full() variants of the polled RCU API, for
example, poll_state_synchronize_rcu_full().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_do_batch() sizes its batches based on the total number
of callbacks in the callback list. This can result in some strange
choices, for example, if there was 12,800 callbacks in the list, but
only 200 were ready to invoke, RCU would invoke 100 at a time (12,800
shifted down by seven bits).
A more measured approach would use the number that were actually ready
to invoke, an approach that has become feasible only recently given the
per-segment ->seglen counts in ->cblist.
This commit therefore bases the batch limit on the number of callbacks
ready to invoke instead of on the total number of callbacks.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit consolidates the initialization and CPU-hotplug code at
the end of kernel/rcu/tree.c. This is strictly a code-motion commit.
No functionality has changed.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit fixes a lockdep false positive in synchronize_rcu() that
can otherwise occur during early boot. Theis fix simply avoids invoking
lockdep if the scheduler has not yet been initialized, that is, during
that portion of boot when interrupts are disabled.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmOeXj8THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jPmZEACaI5JqO6Dr2U4HojJJBYEfLVaSYxDp
JrUi5D5WzzZidyjM2fyyZZkdRVQ24i1aV2H/fbLoIIH/smYjE/KLEFHQmclpphw5
BSOyapotjdt5YhIavvAeOjdUd7jPyMqhbDVnwzjnblhUD1ObLVlhIs8Pjn7/03sF
gzlIhYgp3EL7GenT9j9kud2FwWP+wrVQ7SdJ+Ni/WAHYO8860xQAmFXH/07bYzx7
fbp5iPkCOSSUoRMw/qQ8s7CE3XhBNKufv1BtcvV/uxEtutfV1qvEQBv/l2RBd0Vg
wOVBZnWXze+7IUx13M90R/d04Nn7RaGwon6xBMlvIwL3qzEj8x/r1FYz7zZhQPkv
wwChAxFHQACnLCZSu48WBtVrawNdZHM57KHUK4rloAbrK92FpVznhQU+5pBDy4c6
rfY2my+SNO4kWvePEg/2fd8aQycrZr99fK/ojCIerEn8MNboxuVOYTjzy0qtUcVT
yJ/80O8ADI3QL/NRhjMFWgEnBDbHN1PcGhiRoutApdLQkg/UPTJjCRZ7ibmIFYY2
ViW3cSndr/f0I7sOex2EILHwiZ2bUKiwyeTW6vWuFl/7MEWsvpJaWoUxXgQj99Bt
ncAOaxtmmuhbwrOCt2kab90A0c/thNx9kNYYIkG3vUNcSRzyHQtg3ydEljBpaTFR
OzhrqdUA7W9Sfg==
=UKUo
-----END PGP SIGNATURE-----
Merge tag 'rcu-urgent.2022.12.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
"This fixes a lockdep false positive in synchronize_rcu() that can
otherwise occur during early boot.
The fix simply avoids invoking lockdep if the scheduler has not yet
been initialized, that is, during that portion of boot when interrupts
are disabled"
* tag 'rcu-urgent.2022.12.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Don't assert interrupts enabled too early in boot
The rcu_poll_gp_seq_end() and rcu_poll_gp_seq_end_unlocked() both check
that interrupts are enabled, as they normally should be when waiting for
an RCU grace period. Except that it is legal to wait for grace periods
during early boot, before interrupts have been enabled for the first time,
and polling for grace periods is required to work during this time.
This can result in false-positive lockdep splats in the presence of
boot-time-initiated tracing.
This commit therefore conditions those interrupts-enabled checks on
rcu_scheduler_active having advanced past RCU_SCHEDULER_INACTIVE, by
which time interrupts have been enabled.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This pull request contains the following branches:
doc.2022.10.20a: Documentation updates. This is the second
in a series from an ongoing review of the RCU documentation.
fixes.2022.10.21a: Miscellaneous fixes.
lazy.2022.11.30a: Introduces a default-off Kconfig option that depends
on RCU_NOCB_CPU that, on CPUs mentioned in the nohz_full or
rcu_nocbs boot-argument CPU lists, causes call_rcu() to introduce
delays. These delays result in significant power savings on
nearly idle Android and ChromeOS systems. These savings range
from a few percent to more than ten percent.
This series also includes several commits that change call_rcu()
to a new call_rcu_hurry() function that avoids these delays in
a few cases, for example, where timely wakeups are required.
Several of these are outside of RCU and thus have acks and
reviews from the relevant maintainers.
srcunmisafe.2022.11.09a: Creates an srcu_read_lock_nmisafe() and an
srcu_read_unlock_nmisafe() for architectures that support NMIs,
but which do not provide NMI-safe this_cpu_inc(). These NMI-safe
SRCU functions are required by the upcoming lockless printk()
work by John Ogness et al.
That printk() series depends on these commits, so if you pull
the printk() series before this one, you will have already
pulled in this branch, plus two more SRCU commits:
0cd7e350ab ("rcu: Make SRCU mandatory")
51f5f78a4f ("srcu: Make Tiny synchronize_srcu() check for readers")
These two commits appear to work well, but do not have
sufficient testing exposure over a long enough time for me to
feel comfortable pushing them unless something in mainline is
definitely going to use them immediately, and currently only
the new printk() work uses them.
torture.2022.10.18c: Changes providing minor but important increases
in test coverage for the new RCU polled-grace-period APIs.
torturescript.2022.10.20a: Changes that avoid redundant kernel builds,
thus providing about a 30% speedup for the torture.sh acceptance
test.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmOKnS8THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jCMiD/4weraRjmcLhZ3tz2vgTI8ZsXdIiCfU
vCln0AOKroVo37S4BhViVfryV2D4VFfEb1UY6EgxNFu7Jd3z0seQShZh/5r8bFMU
p0E6TC8PwyKUpQstTOwOynkw6BWGW1qeL620PpBNRAy4MkxL8AGv40tHRIHEeAzc
cCTax2+xW9ae0ZtAZHDDCUAzpYpcjScIf4OZ3tkSaFCcpWZijg+dN60dnsZ9l7h9
DtqKH61rszXAtxkmN9Fs9OY5MPCXi9Es6LVYq6KN06jqxwJRqmYf+pai3apmNIOf
P8isXOQG58tbhBLpNCG58UBSkjI2GG8Lcq6hYr6d/7Ukm7RF49q8eL7OQlVrJMuQ
Zi2DVTEAu2U3pzdTC14gi3RvqP7dO+psBs+LpGXtj4RxYvAP99e9KSRcG14j/Wwa
L52AetBzBXTCS5nhPOG8RP22d8HRZLxMe9x7T8iVCDuwH4M1zTF5cVzLeEdgPAD7
tdX4eV16PLt1AvhCEuHU/2v520gc2K9oGXLI1A6kzquXh7FflcPWl5WS+sYUbB/p
gBsblz7C3I5GgSoW4aAMnkukZiYgSvVql8ZyRwQuRzvLpYcofMpoanZbcufDjuw9
N5QzAaMmzHnBu3hOJS2WaSZRZ73fed3NO8jo8q8EMfYeWK3NAHybBdaQqSTgsO8i
s+aN+LZ4s5MnRw==
=eMOr
-----END PGP SIGNATURE-----
Merge tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates. This is the second in a series from an ongoing
review of the RCU documentation.
- Miscellaneous fixes.
- Introduce a default-off Kconfig option that depends on RCU_NOCB_CPU
that, on CPUs mentioned in the nohz_full or rcu_nocbs boot-argument
CPU lists, causes call_rcu() to introduce delays.
These delays result in significant power savings on nearly idle
Android and ChromeOS systems. These savings range from a few percent
to more than ten percent.
This series also includes several commits that change call_rcu() to a
new call_rcu_hurry() function that avoids these delays in a few
cases, for example, where timely wakeups are required. Several of
these are outside of RCU and thus have acks and reviews from the
relevant maintainers.
- Create an srcu_read_lock_nmisafe() and an srcu_read_unlock_nmisafe()
for architectures that support NMIs, but which do not provide
NMI-safe this_cpu_inc(). These NMI-safe SRCU functions are required
by the upcoming lockless printk() work by John Ogness et al.
- Changes providing minor but important increases in torture test
coverage for the new RCU polled-grace-period APIs.
- Changes to torturescript that avoid redundant kernel builds, thus
providing about a 30% speedup for the torture.sh acceptance test.
* tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits)
net: devinet: Reduce refcount before grace period
net: Use call_rcu_hurry() for dst_release()
workqueue: Make queue_rcu_work() use call_rcu_hurry()
percpu-refcount: Use call_rcu_hurry() for atomic switch
scsi/scsi_error: Use call_rcu_hurry() instead of call_rcu()
rcu/rcutorture: Use call_rcu_hurry() where needed
rcu/rcuscale: Use call_rcu_hurry() for async reader test
rcu/sync: Use call_rcu_hurry() instead of call_rcu
rcuscale: Add laziness and kfree tests
rcu: Shrinker for lazy rcu
rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
rcu: Make call_rcu() lazy to save power
rcu: Implement lockdep_rcu_enabled for !CONFIG_DEBUG_LOCK_ALLOC
srcu: Debug NMI safety even on archs that don't require it
srcu: Explain the reason behind the read side critical section on GP start
srcu: Warn when NMI-unsafe API is used in NMI
arch/s390: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option
arch/loongarch: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option
rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
rcu-tasks: Make grace-period-age message human-readable
...
Implement timer-based RCU callback batching (also known as lazy
callbacks). With this we save about 5-10% of power consumed due
to RCU requests that happen when system is lightly loaded or idle.
By default, all async callbacks (queued via call_rcu) are marked
lazy. An alternate API call_rcu_hurry() is provided for the few users,
for example synchronize_rcu(), that need the old behavior.
The batch is flushed whenever a certain amount of time has passed, or
the batch on a particular CPU grows too big. Also memory pressure will
flush it in a future patch.
To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists which were originally introduced to
address lock contention, to handle lazy CBs as well. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.
[ paulmck: Fix formatting of inline call_rcu_lazy() definition. ]
[ paulmck: Apply Zqiang feedback. ]
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Suggested-by: Paul McKenney <paulmck@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Running rcutorture with non-zero fqs_duration module parameter in a
kernel built with CONFIG_PREEMPTION=y results in the following splat:
BUG: using __this_cpu_read() in preemptible [00000000]
code: rcu_torture_fqs/398
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 3 PID: 398 Comm: rcu_torture_fqs Not tainted 6.0.0-rc1-yoctodev-standard+
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x86
dump_stack+0x10/0x16
check_preemption_disabled+0xe5/0xf0
__this_cpu_preempt_check+0x13/0x20
rcu_force_quiescent_state.part.0+0x1c/0x170
rcu_force_quiescent_state+0x1e/0x30
rcu_torture_fqs+0xca/0x160
? rcu_torture_boost+0x430/0x430
kthread+0x192/0x1d0
? kthread_complete_and_exit+0x30/0x30
ret_from_fork+0x22/0x30
</TASK>
The problem is that rcu_force_quiescent_state() uses __this_cpu_read()
in preemptible code instead of the proper raw_cpu_read(). This commit
therefore changes __this_cpu_read() to raw_cpu_read().
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The commit 3fcd6a230f ("x86/cpu: Avoid cpuinfo-induced IPIing of
idle CPUs") introduced rcu_is_idle_cpu() in order to identify the
current CPU idle state. But commit f3eca381bd ("x86/aperfmperf:
Replace arch_freq_get_on_cpu()") switched to using MAX_SAMPLE_AGE,
so rcu_is_idle_cpu() is no longer used. This commit therefore removes it.
Fixes: f3eca381bd ("x86/aperfmperf: Replace arch_freq_get_on_cpu()")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Making polled RCU grace periods account for expedited grace periods
required acquiring the leaf rcu_node structure's lock during early boot,
but after rcu_init() was called. This lock is irq-disabled, but the
code incorrectly assumes that irqs are always disabled when invoking
synchronize_rcu(). The exception is early boot before the scheduler has
started, which means that upon return from synchronize_rcu(), irqs will
be incorrectly enabled.
This commit fixes this bug by using irqsave/irqrestore locking primitives.
Fixes: bf95b2bc3e ("rcu: Switch polled grace-period APIs to ->gp_seq_polled")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In preparation for RCU lazy changes, wake up the RCU nocb gp thread if
needed after an entrain. This change prevents the RCU barrier callback
from waiting in the queue for several seconds before the lazy callbacks
in front of it are serviced.
Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rnp->qsmask is locklessly accessed from rcutree_dying_cpu(). This
may help avoid load tearing due to concurrent access, KCSAN
issues, and preserve sanity of people reading the mask in tracing.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_report_dead() function invokes rcu_report_exp_rdp() in order
to force an immediate expedited quiescent state on the outgoing
CPU, and then it invokes rcu_preempt_deferred_qs() to provide any
required deferred quiescent state of either sort. Because the call to
rcu_preempt_deferred_qs() provides the expedited RCU quiescent state if
requested, the call to rcu_report_exp_rdp() is potentially redundant.
One possible issue is a concurrent start of a new expedited RCU
grace period, but this situation is already handled correctly
by __sync_rcu_exp_select_node_cpus(). This function will detect
that the CPU is going offline via the error return from its call
to smp_call_function_single(). In that case, it will retry, and
eventually stop retrying due to rcu_report_exp_rdp() clearing the
->qsmaskinitnext bit corresponding to the target CPU. As a result,
__sync_rcu_exp_select_node_cpus() will report the necessary quiescent
state after dealing with any remaining CPU.
This change assumes that control does not enter rcu_report_dead() within
an RCU read-side critical section, but then again, the surviving call
to rcu_preempt_deferred_qs() has always made this assumption.
This commit therefore removes the call to rcu_report_exp_rdp(), thus
relying on rcu_preempt_deferred_qs() to handle both normal and expedited
quiescent states.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>