mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
ef1b51f773
1152 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
77a05940ee |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "The biggest changes in this cycle were: - Make kcpustat vtime aware (Frederic Weisbecker) - Rework the CFS load_balance() logic (Vincent Guittot) - Misc cleanups, smaller enhancements, fixes. The load-balancing rework is the most intrusive change: it replaces the old heuristics that have become less meaningful after the introduction of the PELT metrics, with a grounds-up load-balancing algorithm. As such it's not really an iterative series, but replaces the old load-balancing logic with the new one. We hope there are no performance regressions left - but statistically it's highly probable that there *is* going to be some workload that is hurting from these chnages. If so then we'd prefer to have a look at that workload and fix its scheduling, instead of reverting the changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) rackmeter: Use vtime aware kcpustat accessor leds: Use all-in-one vtime aware kcpustat accessor cpufreq: Use vtime aware kcpustat accessors for user time procfs: Use all-in-one vtime aware kcpustat accessor sched/vtime: Bring up complete kcpustat accessor sched/cputime: Support other fields on kcpustat_field() sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util() sched/fair: Add comments for group_type and balancing at SD_NUMA level sched/fair: Fix rework of find_idlest_group() sched/uclamp: Fix overzealous type replacement sched/Kconfig: Fix spelling mistake in user-visible help text sched/core: Further clarify sched_class::set_next_task() sched/fair: Use mul_u32_u32() sched/core: Simplify sched_class::pick_next_task() sched/core: Optimize pick_next_task() sched/core: Make pick_next_task_idle() more consistent sched/fair: Better document newidle_balance() leds: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM cpufreq: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM procfs: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM ... |
||
|
|
fb4b3d3fd0 |
for-5.5/io_uring-20191121
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxNwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps4kD/9SIDXhYhhE8fNqeAF7Uouu8fxgwnkY3hSI
43vJwCziiDxWWJH5mYW7/83VNOMZKHIbiYMnU6iEUsRQ/sG/wI0wEfAQZDHLzCKt
cko2q7zAC1/4rtoslwJ3q04hE2Ap/nb93ELZBVr7fOAuODBNFUp/vifAojvsMPKz
hNMNPq/vYg7c/iYMZKSBdtjE3tqceFNBjAVNMB9dHKQLeexEy4ve7AjBeawWsSi7
GesnQ5w5u5LqkMYwLslpv/oVjHiiFWgGnDAvBNvykQvVy+DfB54KSqMV11W1aqdU
l6L+ENfZasEvlk1yMAth2Foq4vlscm5MKEb6VdJhXWHHXtXkcBmz7RBqPmjSvXCY
wS5GZRw8oYtTcid0aQf+t/wgRNTDJsGsnsT32qto41No3Z7vlIDHUDxHZGTA+gEL
E8j9rDx6EXMTo3EFbC8XZcfsorhPJ1HKAyw1YFczHtYzJEQUR9jJe3f/Q9u6K2Vy
s/EhkVeHa/lEd7kb6mI+6lQjGe1FXl7AHauDuaaEfIOZA/xJB3Bad5Wjq1va1cUO
TX+37zjzFzJghhSIBGYq7G7iT4AMecPQgxHzCdCyYfW5S4Uur9tMmIElwVPI/Pjl
kDZ9gdg9lm6JifZ9Ab8QcGhuQQTF3frwX9VfgrVgcqyvm38AiYzVgL9ZJnxRS/Cy
ZfLNkACXqQ==
=YZ9s
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"A lot of stuff has been going on this cycle, with improving the
support for networked IO (and hence unbounded request completion
times) being one of the major themes. There's been a set of fixes done
this week, I'll send those out as well once we're certain we're fully
happy with them.
This contains:
- Unification of the "normal" submit path and the SQPOLL path (Pavel)
- Support for sparse (and bigger) file sets, and updating of those
file sets without needing to unregister/register again.
- Independently sized CQ ring, instead of just making it always 2x
the SQ ring size. This makes it more flexible for networked
applications.
- Support for overflowed CQ ring, never dropping events but providing
backpressure on submits.
- Add support for absolute timeouts, not just relative ones.
- Support for generic cancellations. This divorces io_uring from
workqueues as well, which additionally gets us one step closer to
generic async system call support.
- With cancellations, we can support grabbing the process file table
as well, just like we do mm context. This allows support for system
calls that create file descriptors, like accept4() support that's
built on top of that.
- Support for io_uring tracing (Dmitrii)
- Support for linked timeouts. These abort an operation if it isn't
completed by the time noted in the linke timeout.
- Speedup tracking of poll requests
- Various cleanups making the coder easier to follow (Jackie, Pavel,
Bob, YueHaibing, me)
- Update MAINTAINERS with new io_uring list"
* tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
io_uring: make POLL_ADD/POLL_REMOVE scale better
io-wq: remove now redundant struct io_wq_nulls_list
io_uring: Fix getting file for non-fd opcodes
io_uring: introduce req_need_defer()
io_uring: clean up io_uring_cancel_files()
io-wq: ensure free/busy list browsing see all items
io-wq: ensure we have a stable view of ->cur_work for cancellations
io_wq: add get/put_work handlers to io_wq_create()
io_uring: check for validity of ->rings in teardown
io_uring: fix potential deadlock in io_poll_wake()
io_uring: use correct "is IO worker" helper
io_uring: fix -ENOENT issue with linked timer with short timeout
io_uring: don't do flush cancel under inflight_lock
io_uring: flag SQPOLL busy condition to userspace
io_uring: make ASYNC_CANCEL work with poll and timeout
io_uring: provide fallback request for OOM situations
io_uring: convert accept4() -ERESTARTSYS into -EINTR
io_uring: fix error clear of ->file_table in io_sqe_files_register()
io_uring: separate the io_free_req and io_free_req_find_next interface
io_uring: keep io_put_req only responsible for release and put req
...
|
||
|
|
b21feab0b8 |
Linux 5.4-rc8
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3RzgkeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN18H/0JZbfIpy8/4Irol 0va7Aj2fBi1a5oxfqYsMKN0u3GKbN3OV9tQ+7w1eBNGvL72TGadgVTzTY+Im7A9U UjboAc7jDPCG+YhIwXFufMiIAq5jDIj6h0LDas7ALsMfsnI/RhTwgNtLTAkyI3dH YV/6ljFULwueJHCxzmrYbd1x39PScj3kCNL2pOe6On7rXMKOemY/nbbYYISxY30E GMgKApSS+li7VuSqgrKoq5Qaox26LyR2wrXB1ij4pqEJ9xgbnKRLdHuvXZnE+/5p 46EMirt+yeSkltW3d2/9MoCHaA76ESzWMMDijLx7tPgoTc3RB3/3ZLsm3rYVH+cR cRlNNSk= =0+Cg -----END PGP SIGNATURE----- Merge tag 'v5.4-rc8' into sched/core, to pick up fixes and dependencies Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
7763baace1 |
sched/uclamp: Fix overzealous type replacement
Some uclamp helpers had their return type changed from 'unsigned int' to 'enum uclamp_id' by commit |
||
|
|
6e1ff0773f |
sched/uclamp: Fix incorrect condition
uclamp_update_active() should perform the update when
p->uclamp[clamp_id].active is true. But when the logic was inverted in
[1], the if condition wasn't inverted correctly too.
[1] https://lore.kernel.org/lkml/20190902073836.GO2369@hirez.programming.kicks-ass.net/
Reported-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
|
||
|
|
ff51ff84d8 |
sched/core: Avoid spurious lock dependencies
While seemingly harmless, __sched_fork() does hrtimer_init(), which,
when DEBUG_OBJETS, can end up doing allocations.
This then results in the following lock order:
rq->lock
zone->lock.rlock
batched_entropy_u64.lock
Which in turn causes deadlocks when we do wakeups while holding that
batched_entropy lock -- as the random code does.
Solve this by moving __sched_fork() out from under rq->lock. This is
safe because nothing there relies on rq->lock, as also evident from the
other __sched_fork() callsite.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: bigeasy@linutronix.de
Cc: cl@linux.com
Cc: keescook@chromium.org
Cc: penberg@kernel.org
Cc: rientjes@google.com
Cc: thgarnie@google.com
Cc: tytso@mit.edu
Cc: will@kernel.org
Fixes:
|
||
|
|
98c2f700ed |
sched/core: Simplify sched_class::pick_next_task()
Now that the indirect class call never uses the last two arguments of pick_next_task(), remove them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20191108131909.660595546@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
5d7d605642 |
sched/core: Optimize pick_next_task()
Ever since we moved the sched_class definitions into their own files,
the constant expression {fair,idle}_sched_class.pick_next_task() is
not in fact a compile time constant anymore and results in an indirect
call (barring LTO).
Fix that by exposing pick_next_task_{fair,idle}() directly, this gets
rid of the indirect call (and RETPOLINE) on the fast path.
Also remove the unlikely() from the idle case, it is in fact /the/ way
we select idle -- and that is a very common thing to do.
Performance for will-it-scale/sched_yield improves by 2% (as reported
by 0-day).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.603037345@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
f488e1057b |
sched/core: Make pick_next_task_idle() more consistent
Only pick_next_task_fair() needs the @prev and @rf argument; these are required to implement the cpu-cgroup optimization. None of the other pick_next_task() methods need this. Make pick_next_task_idle() more consistent. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20191108131909.545730862@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
6e2df0581f |
sched: Fix pick_next_task() vs 'change' pattern race
Commit |
||
|
|
e3b8b6a0d1 |
sched/core: Fix compilation error when cgroup not selected
When cgroup is disabled the following compilation error was hit
kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
struct css_task_iter it;
^~
kernel/sched/core.c:1084:2: error: implicit declaration of function ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? [-Werror=implicit-function-declaration]
css_task_iter_start(css, 0, &it);
^~~~~~~~~~~~~~~~~~~
__sg_page_iter_start
kernel/sched/core.c:1085:14: error: implicit declaration of function ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? [-Werror=implicit-function-declaration]
while ((p = css_task_iter_next(&it))) {
^~~~~~~~~~~~~~~~~~
__sg_page_iter_next
kernel/sched/core.c:1091:2: error: implicit declaration of function ‘css_task_iter_end’; did you mean ‘get_task_cred’? [-Werror=implicit-function-declaration]
css_task_iter_end(&it);
^~~~~~~~~~~~~~~~~
get_task_cred
kernel/sched/core.c:1081:23: warning: unused variable ‘it’ [-Wunused-variable]
struct css_task_iter it;
^~
cc1: some warnings being treated as errors
make[2]: *** [kernel/sched/core.o] Error 1
Fix by protetion uclamp_update_active_tasks() with
CONFIG_UCLAMP_TASK_GROUP
Fixes:
|
||
|
|
771b53d033 |
io-wq: small threadpool implementation for io_uring
This adds support for io-wq, a smaller and specialized thread pool implementation. This is meant to replace workqueues for io_uring. Among the reasons for this addition are: - We can assign memory context smarter and more persistently if we manage the life time of threads. - We can drop various work-arounds we have in io_uring, like the async_list. - We can implement hashed work insertion, to manage concurrency of buffered writes without needing a) an extra workqueue, or b) needlessly making the concurrency of said workqueue very low which hurts performance of multiple buffered file writers. - We can implement cancel through signals, for cancelling interruptible work like read/write (or send/recv) to/from sockets. - We need the above cancel for being able to assign and use file tables from a process. - We can implement a more thorough cancel operation in general. - We need it to move towards a syslet/threadlet model for even faster async execution. For that we need to take ownership of the used threads. This list is just off the top of my head. Performance should be the same, or better, at least that's what I've seen in my testing. io-wq supports basic NUMA functionality, setting up a pool per node. io-wq hooks up to the scheduler schedule in/out just like workqueue and uses that to drive the need for more/less workers. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
5facae4f35 |
locking/lockdep: Remove unused @nested argument from lock_release()
Since the following commit:
|
||
|
|
dff3a85fec |
sched_setattr: switch to copy_struct_from_user()
Switch sched_setattr() syscall from it's own copying struct sched_attr
from userspace to the new dedicated copy_struct_from_user() helper.
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Ideally we could also
unify sched_getattr(2)-style syscalls as well, but unfortunately the
correct semantics for such syscalls are much less clear (see [1] for
more detail). In future we could come up with a more sane idea for how
the syscall interface should look.
[1]: commit
|
||
|
|
9c5efe9ae7 |
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar: - Apply a number of membarrier related fixes and cleanups, which fixes a use-after-free race in the membarrier code - Introduce proper RCU protection for tasks on the runqueue - to get rid of the subtle task_rcu_dereference() interface that was easy to get wrong - Misc fixes, but also an EAS speedup * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Avoid redundant EAS calculation sched/core: Remove double update_max_interval() call on CPU startup sched/core: Fix preempt_schedule() interrupt return comment sched/fair: Fix -Wunused-but-set-variable warnings sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() sched/membarrier: Return -ENOMEM to userspace on memory allocation failure sched/membarrier: Skip IPIs when mm->mm_users == 1 selftests, sched/membarrier: Add multi-threaded test sched/membarrier: Fix p->mm->membarrier_state racy load sched/membarrier: Call sync_core only before usermode for same mm sched/membarrier: Remove redundant check sched/membarrier: Fix private expedited registration check tasks, sched/core: RCUify the assignment of rq->curr tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue tasks: Add a count of task RCU users sched/core: Convert vcpu_is_preempted() from macro to an inline function sched/fair: Remove unused cfs_rq_clock_task() function |
||
|
|
9fc41acc89 |
sched/core: Remove double update_max_interval() call on CPU startup
update_max_interval() is called in both CPUHP_AP_SCHED_STARTING's startup and teardown callbacks, but it turns out it's also called at the end of the startup callback of CPUHP_AP_ACTIVE (which is further down the startup sequence). There's no point in repeating this interval update in the startup sequence since the CPU will remain online until it goes down the teardown path. Remove the redundant call in sched_cpu_activate() (CPUHP_AP_ACTIVE). Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190923093017.11755-1-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
a49b4f4012 |
sched/core: Fix preempt_schedule() interrupt return comment
preempt_schedule_irq() is the one that should be called on return from interrupt, clean up the comment to avoid any ambiguity. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-m68k@lists.linux-m68k.org Cc: linux-riscv@lists.infradead.org Cc: uclinux-h8-devel@lists.sourceforge.jp Link: https://lkml.kernel.org/r/20190923143620.29334-2-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
714e501e16 |
sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
An oops can be triggered in the scheduler when running qemu on arm64: Unable to handle kernel paging request at virtual address ffff000008effe40 Internal error: Oops: 96000007 [#1] SMP Process migration/0 (pid: 12, stack limit = 0x00000000084e3736) pstate: 20000085 (nzCv daIf -PAN -UAO) pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20 lr : move_queued_task.isra.21+0x124/0x298 ... Call trace: __ll_sc___cmpxchg_case_acq_4+0x4/0x20 __migrate_task+0xc8/0xe0 migration_cpu_stop+0x170/0x180 cpu_stopper_thread+0xec/0x178 smpboot_thread_fn+0x1ac/0x1e8 kthread+0x134/0x138 ret_from_fork+0x10/0x18 __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to migrage the process if process is not currently running on any one of the CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if corresponding bit is coincidentally set. As a consequence, kernel will access an invalid rq address associate with the invalid CPU in migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs. The reproduce the crash: 1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling sched_setaffinity. 2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online" and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn. 3) Oops appears if the invalid CPU is set in memory after tested cpumask. Signed-off-by: KeMeng Shi <shikemeng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
227a4aadc7 |
sched/membarrier: Fix p->mm->membarrier_state racy load
The membarrier_state field is located within the mm_struct, which is not guaranteed to exist when used from runqueue-lock-free iteration on runqueues by the membarrier system call. Copy the membarrier_state from the mm_struct into the scheduler runqueue when the scheduler switches between mm. When registering membarrier for mm, after setting the registration bit in the mm membarrier state, issue a synchronize_rcu() to ensure the scheduler observes the change. In order to take care of the case where a runqueue keeps executing the target mm without swapping to other mm, iterate over each runqueue and issue an IPI to copy the membarrier_state from the mm_struct into each runqueue which have the same mm which state has just been modified. Move the mm membarrier_state field closer to pgd in mm_struct to use a cache line already touched by the scheduler switch_mm. The membarrier_execve() (now membarrier_exec_mmap) hook now needs to clear the runqueue's membarrier state in addition to clear the mm membarrier state, so move its implementation into the scheduler membarrier code so it can access the runqueue structure. Add memory barrier in membarrier_exec_mmap() prior to clearing the membarrier state, ensuring memory accesses executed prior to exec are not reordered with the stores clearing the membarrier state. As suggested by Linus, move all membarrier.c RCU read-side locks outside of the for each cpu loops. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
5311a98fef |
tasks, sched/core: RCUify the assignment of rq->curr
The current task on the runqueue is currently read with rcu_dereference(). To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs to be paired with rcu_assign_pointer() of rq->curr. Which provides the memory barrier necessary to order assignments to the task_struct and the assignment to rq->curr. Unfortunately the assignment of rq->curr in __schedule is a hot path, and it has already been show that additional barriers in that code will reduce the performance of the scheduler. So I will attempt to describe below why you can effectively have ordinary RCU semantics without any additional barriers. The assignment of rq->curr in init_idle is a slow path called once per cpu and that can use rcu_assign_pointer() without any concerns. As I write this there are effectively two users of rcu_dereference() on rq->curr. There is the membarrier code in kernel/sched/membarrier.c that only looks at "->mm" after the rcu_dereference(). Then there is task_numa_compare() in kernel/sched/fair.c. My best reading of the code shows that task_numa_compare only access: "->flags", "->cpus_ptr", "->numa_group", "->numa_faults[]", "->total_numa_faults", and "->se.cfs_rq". The code in __schedule() essentially does: rq_lock(...); smp_mb__after_spinlock(); next = pick_next_task(...); rq->curr = next; context_switch(prev, next); At the start of the function the rq_lock/smp_mb__after_spinlock pair provides a full memory barrier. Further there is a full memory barrier in context_switch(). This means that any task that has already run and modified itself (the common case) has already seen two memory barriers before __schedule() runs and begins executing. A task that modifies itself then sees a third full memory barrier pair with the rq_lock(); For a brand new task that is enqueued with wake_up_new_task() there are the memory barriers present from the taking and release the pi_lock and the rq_lock as the processes is enqueued as well as the full memory barrier at the start of __schedule() assuming __schedule() happens on the same cpu. This means that by the time we reach the assignment of rq->curr except for values on the task struct modified in pick_next_task the code has the same guarantees as if it used rcu_assign_pointer(). Reading through all of the implementations of pick_next_task it appears pick_next_task is limited to modifying the task_struct fields "->se", "->rt", "->dl". These fields are the sched_entity structures of the varies schedulers. Further "->se.cfs_rq" is only changed in cgroup attach/move operations initialized by userspace. Unless I have missed something this means that in practice that the users of "rcu_dereference(rq->curr)" get normal RCU semantics of rcu_dereference() for the fields the care about, despite the assignment of rq->curr in __schedule() ot using rcu_assign_pointer. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
0ff7b2cfba |
tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
In the ordinary case today the RCU grace period for a task_struct is triggered when another process wait's for it's zombine and causes the kernel to call release_task(). As the waiting task has to receive a signal and then act upon it before this happens, typically this will occur after the original task as been removed from the runqueue. Unfortunaty in some cases such as self reaping tasks it can be shown that release_task() will be called starting the grace period for task_struct long before the task leaves the runqueue. Therefore use put_task_struct_rcu_user() in finish_task_switch() to guarantee that the there is a RCU lifetime after the task leaves the runqueue. Besides the change in the start of the RCU grace period for the task_struct this change may cause perf_event_delayed_put and trace_sched_process_free. The function perf_event_delayed_put boils down to just a WARN_ON for cases that I assume never show happen. So I don't see any problem with delaying it. The function trace_sched_process_free is a trace point and thus visible to user space. Occassionally userspace has the strangest dependencies so this has a miniscule chance of causing a regression. This change only changes the timing of when the tracepoint is called. The change in timing arguably gives userspace a more accurate picture of what is going on. So I don't expect there to be a regression. In the case where a task self reaps we are pretty much guaranteed that the RCU grace period is delayed. So we should get quite a bit of coverage in of this worst case for the change in a normal threaded workload. So I expect any issues to turn up quickly or not at all. I have lightly tested this change and everything appears to work fine. Inspired-by: Linus Torvalds <torvalds@linux-foundation.org> Inspired-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
84da111de0 |
hmm related patches for 5.4
This is more cleanup and consolidation of the hmm APIs and the very
strongly related mmu_notifier interfaces. Many places across the tree
using these interfaces are touched in the process. Beyond that a cleanup
to the page walker API and a few memremap related changes round out the
series:
- General improvement of hmm_range_fault() and related APIs, more
documentation, bug fixes from testing, API simplification &
consolidation, and unused API removal
- Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and
make them internal kconfig selects
- Hoist a lot of code related to mmu notifier attachment out of drivers by
using a refcount get/put attachment idiom and remove the convoluted
mmu_notifier_unregister_no_release() and related APIs.
- General API improvement for the migrate_vma API and revision of its only
user in nouveau
- Annotate mmu_notifiers with lockdep and sleeping region debugging
Two series unrelated to HMM or mmu_notifiers came along due to
dependencies:
- Allow pagemap's memremap_pages family of APIs to work without providing
a struct device
- Make walk_page_range() and related use a constant structure for function
pointers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl1/nnkACgkQOG33FX4g
mxqaRg//c6FqowV1pQlLutvAOAgMdpzfZ9eaaDKngy9RVQxz+k/MmJrdRH/p/mMA
Pq93A1XfwtraGKErHegFXGEDk4XhOustVAVFwvjyXO41dTUdoFVUkti6ftbrl/rS
6CT+X90jlvrwdRY7QBeuo7lxx7z8Qkqbk1O1kc1IOracjKfNJS+y6LTamy6weM3g
tIMHI65PkxpRzN36DV9uCN5dMwFzJ73DWHp1b0acnDIigkl6u5zp6orAJVWRjyQX
nmEd3/IOvdxaubAoAvboNS5CyVb4yS9xshWWMbH6AulKJv3Glca1Aa7QuSpBoN8v
wy4c9+umzqRgzgUJUe1xwN9P49oBNhJpgBSu8MUlgBA4IOc3rDl/Tw0b5KCFVfkH
yHkp8n6MP8VsRrzXTC6Kx0vdjIkAO8SUeylVJczAcVSyHIo6/JUJCVDeFLSTVymh
EGWJ7zX2iRhUbssJ6/izQTTQyCH3YIyZ5QtqByWuX2U7ZrfkqS3/EnBW1Q+j+gPF
Z2yW8iT6k0iENw6s8psE9czexuywa/Lttz94IyNlOQ8rJTiQqB9wLaAvg9hvUk7a
kuspL+JGIZkrL3ouCeO/VA6xnaP+Q7nR8geWBRb8zKGHmtWrb5Gwmt6t+vTnCC2l
olIDebrnnxwfBQhEJ5219W+M1pBpjiTpqK/UdBd92A4+sOOhOD0=
=FRGg
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This is more cleanup and consolidation of the hmm APIs and the very
strongly related mmu_notifier interfaces. Many places across the tree
using these interfaces are touched in the process. Beyond that a
cleanup to the page walker API and a few memremap related changes
round out the series:
- General improvement of hmm_range_fault() and related APIs, more
documentation, bug fixes from testing, API simplification &
consolidation, and unused API removal
- Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
and make them internal kconfig selects
- Hoist a lot of code related to mmu notifier attachment out of
drivers by using a refcount get/put attachment idiom and remove the
convoluted mmu_notifier_unregister_no_release() and related APIs.
- General API improvement for the migrate_vma API and revision of its
only user in nouveau
- Annotate mmu_notifiers with lockdep and sleeping region debugging
Two series unrelated to HMM or mmu_notifiers came along due to
dependencies:
- Allow pagemap's memremap_pages family of APIs to work without
providing a struct device
- Make walk_page_range() and related use a constant structure for
function pointers"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
libnvdimm: Enable unit test infrastructure compile checks
mm, notifier: Catch sleeping/blocking for !blockable
kernel.h: Add non_block_start/end()
drm/radeon: guard against calling an unpaired radeon_mn_unregister()
csky: add missing brackets in a macro for tlb.h
pagewalk: use lockdep_assert_held for locking validation
pagewalk: separate function pointers from iterator data
mm: split out a new pagewalk.h header from mm.h
mm/mmu_notifiers: annotate with might_sleep()
mm/mmu_notifiers: prime lockdep
mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
mm/hmm: hmm_range_fault() infinite loop
mm/hmm: hmm_range_fault() NULL pointer bug
mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
mm/mmu_notifiers: remove unregister_no_release
RDMA/odp: remove ib_ucontext from ib_umem
RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
...
|
||
|
|
7f2444d38f |
Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:
- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.
An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.
- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.
- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function
- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.
- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.
- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.
- Updates to various clocksource/clockevent drivers and their device
tree bindings.
- The usual small improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...
|
||
|
|
7e67a85999 |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.
As perf and the scheduler is getting bigger and more complex,
document the status quo of current responsibilities and interests,
and spread the review pain^H^H^H^H fun via an increase in the Cc:
linecount generated by scripts/get_maintainer.pl. :-)
- Add another series of patches that brings the -rt (PREEMPT_RT) tree
closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
into a new CONFIG_PREEMPTION category that will allow the eventual
introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
to go though.
- Extend the CPU cgroup controller with uclamp.min and uclamp.max to
allow the finer shaping of CPU bandwidth usage.
- Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).
- Improve the behavior of high CPU count, high thread count
applications running under cpu.cfs_quota_us constraints.
- Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.
- Improve CPU isolation housekeeping CPU allocation NUMA locality.
- Fix deadline scheduler bandwidth calculations and logic when cpusets
rebuilds the topology, or when it gets deadline-throttled while it's
being offlined.
- Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
setscheduler() system calls without creating global serialization.
Add new synchronization between cpuset topology-changing events and
the deadline acceptance tests in setscheduler(), which were broken
before.
- Rework the active_mm state machine to be less confusing and more
optimal.
- Rework (simplify) the pick_next_task() slowpath.
- Improve load-balancing on AMD EPYC systems.
- ... and misc cleanups, smaller fixes and improvements - please see
the Git log for more details.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
sched/psi: Correct overly pessimistic size calculation
sched/fair: Speed-up energy-aware wake-ups
sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
sched/uclamp: Update CPU's refcount on TG's clamp changes
sched/uclamp: Use TG's clamps to restrict TASK's clamps
sched/uclamp: Propagate system defaults to the root group
sched/uclamp: Propagate parent clamps
sched/uclamp: Extend CPU's cgroup controller
sched/topology: Improve load balancing on AMD EPYC systems
arch, ia64: Make NUMA select SMP
sched, perf: MAINTAINERS update, add submaintainers and reviewers
sched/fair: Use rq_lock/unlock in online_fair_sched_group
cpufreq: schedutil: fix equation in comment
sched: Rework pick_next_task() slow-path
sched: Allow put_prev_task() to drop rq->lock
sched/fair: Expose newidle_balance()
sched: Add task_struct pointer to sched_class::set_curr_task
sched: Rework CPU hotplug task selection
sched/{rt,deadline}: Fix set_next_task vs pick_next_task
sched: Fix kerneldoc comment for ia64_set_curr_task
...
|
||
|
|
94d18ee934 |
Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
"This cycle's RCU changes were:
- A few more RCU flavor consolidation cleanups.
- Updates to RCU's list-traversal macros improving lockdep usability.
- Forward-progress improvements for no-CBs CPUs: Avoid ignoring
incoming callbacks during grace-period waits.
- Forward-progress improvements for no-CBs CPUs: Use ->cblist
structure to take advantage of others' grace periods.
- Also added a small commit that avoids needlessly inflicting
scheduler-clock ticks on callback-offloaded CPUs.
- Forward-progress improvements for no-CBs CPUs: Reduce contention on
->nocb_lock guarding ->cblist.
- Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
list to further reduce contention on ->nocb_lock guarding ->cblist.
- Miscellaneous fixes.
- Torture-test updates.
- minor LKMM updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
MAINTAINERS: Update from paulmck@linux.ibm.com to paulmck@kernel.org
rcu: Don't include <linux/ktime.h> in rcutiny.h
rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
rcu/nocb: Add bypass callback queueing
rcu/nocb: Atomic ->len field in rcu_segcblist structure
rcu/nocb: Unconditionally advance and wake for excessive CBs
rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
rcu/nocb: Reduce contention at no-CBs invocation-done time
rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
rcu/nocb: Round down for number of no-CBs grace-period kthreads
rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
...
|
||
|
|
563c4f85f9 |
Merge branch 'sched/rt' into sched/core, to pick up -rt changes
Pick up the first couple of patches working towards PREEMPT_RT. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
312364f353 |
kernel.h: Add non_block_start/end()
In some special cases we must not block, but there's not a spinlock, preempt-off, irqs-off or similar critical section already that arms the might_sleep() debug checks. Add a non_block_start/end() pair to annotate these. This will be used in the oom paths of mmu-notifiers, where blocking is not allowed to make sure there's forward progress. Quoting Michal: "The notifier is called from quite a restricted context - oom_reaper - which shouldn't depend on any locks or sleepable conditionals. The code should be swift as well but we mostly do care about it to make a forward progress. Checking for sleepable context is the best thing we could come up with that would describe these demands at least partially." Peter also asked whether we want to catch spinlocks on top, but Michal said those are less of a problem because spinlocks can't have an indirect dependency upon the page allocator and hence close the loop with the oom reaper. Suggested by Michal Hocko. Link: https://lore.kernel.org/r/20190826201425.17547-4-daniel.vetter@ffwll.ch Acked-by: Christian König <christian.koenig@amd.com> (v1) Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> |
||
|
|
1251201c0d |
sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code
Thadeu Lima de Souza Cascardo reported that 'chrt' broke on recent kernels: $ chrt -p $$ chrt: failed to get pid 26306's policy: Argument list too long and he has root-caused the bug to the following commit increasing sched_attr size and breaking sched_read_attr() into returning -EFBIG: |
||
|
|
0413d7f33e |
sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
The supported clamp indexes are defined in 'enum clamp_id', however, because of the code logic in some of the first utilization clamping series version, sometimes we needed to use 'unsigned int' to represent indices. This is not more required since the final version of the uclamp_* APIs can always use the proper enum uclamp_id type. Fix it with a bulk rename now that we have all the bits merged. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-7-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
babbe170e0 |
sched/uclamp: Update CPU's refcount on TG's clamp changes
On updates of task group (TG) clamp values, ensure that these new values are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE tasks are immediately boosted and/or capped as requested. Do that each time we update effective clamps from cpu_util_update_eff(). Use the *cgroup_subsys_state (css) to walk the list of tasks in each affected TG and update their RUNNABLE tasks. Update each task by using the same mechanism used for cpu affinity masks updates, i.e. by taking the rq lock. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-6-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
3eac870a32 |
sched/uclamp: Use TG's clamps to restrict TASK's clamps
When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.
Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:
1. ensure cgroup clamps are always used to restrict task specific requests,
i.e. boosted not more than its TG effective protection and capped at
least as its TG effective limit.
2. implement a "nice-like" policy, where tasks are still allowed to request
less than what enforced by their TG effective limits and protections
Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.
Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
7274a5c1bb |
sched/uclamp: Propagate system defaults to the root group
The clamp values are not tunable at the level of the root task group. That's for two main reasons: - the root group represents "system resources" which are always entirely available from the cgroup standpoint. - when tuning/restricting "system resources" makes sense, tuning must be done using a system wide API which should also be available when control groups are not. When a system wide restriction is available, cgroups should be aware of its value in order to know exactly how much "system resources" are available for the subgroups. Utilization clamping supports already the concepts of: - system defaults: which define the maximum possible clamp values usable by tasks. - effective clamps: which allows a parent cgroup to constraint (maybe temporarily) its descendants without losing the information related to the values "requested" from them. Exploit these two concepts and bind them together in such a way that, whenever system default are tuned, the new values are propagated to (possibly) restrict or relax the "effective" value of nested cgroups. When cgroups are in use, force an update of all the RUNNABLE tasks. Otherwise, keep things simple and do just a lazy update next time each task will be enqueued. Do that since we assume a more strict resource control is required when cgroups are in use. This allows also to keep "effective" clamp values updated in case we need to expose them to user-space. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-4-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
0b60ba2dd3 |
sched/uclamp: Propagate parent clamps
In order to properly support hierarchical resources control, the cgroup delegation model requires that attribute writes from a child group never fail but still are locally consistent and constrained based on parent's assigned resources. This requires to properly propagate and aggregate parent attributes down to its descendants. Implement this mechanism by adding a new "effective" clamp value for each task group. The effective clamp value is defined as the smaller value between the clamp value of a group and the effective clamp value of its parent. This is the actual clamp value enforced on tasks in a task group. Since it's possible for a cpu.uclamp.min value to be bigger than the cpu.uclamp.max value, ensure local consistency by restricting each "protection" (i.e. min utilization) with the corresponding "limit" (i.e. max utilization). Do that at effective clamps propagation to ensure all user-space write never fails while still always tracking the most restrictive values. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutny <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
2480c09313 |
sched/uclamp: Extend CPU's cgroup controller
The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.
With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.
Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.
Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.
Specifically:
- uclamp.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the uclamp.min
utilization
- uclamp.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the uclamp.max
utilization
These attributes:
a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups. This system wide
interface enforces constraints on tasks in the root node.
b) enforce effective constraints at each level of the hierarchy which
are a restriction of the group requests considering its parent's
effective constraints. Root group effective constraints are defined
by the system wide interface.
This mechanism allows each (non-root) level of the hierarchy to:
- request whatever clamp values it would like to get
- effectively get only up to the maximum amount allowed by its parent
c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests.
Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.
Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
b0fdc01354 |
sched/core: Schedule new worker even if PI-blocked
If a task is PI-blocked (blocking on sleeping spinlock) then we don't want to schedule a new kworker if we schedule out due to lock contention because !RT does not do that as well. A spinning spinlock disables preemption and a worker does not schedule out on lock contention (but spin). On RT the RW-semaphore implementation uses an rtmutex so tsk_is_pi_blocked() will return true if a task blocks on it. In this case we will now start a new worker which may deadlock if one worker is waiting on progress from another worker. Since a RW-semaphore starts a new worker on !RT, we should do the same on RT. XFS is able to trigger this deadlock. Allow to schedule new worker if the current worker is PI-blocked. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20190816160626.12742-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
67692435c4 |
sched: Rework pick_next_task() slow-path
Avoid the RETRY_TASK case in the pick_next_task() slow path. By doing the put_prev_task() early, we get the rt/deadline pull done, and by testing rq->nr_running we know if we need newidle_balance(). This then gives a stable state to pick a task from. Since the fast-path is fair only; it means the other classes will always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/aa34d24b36547139248f32a30138791ac6c02bd6.1559129225.git.vpillai@digitalocean.com |
||
|
|
5f2a45fc9e |
sched: Allow put_prev_task() to drop rq->lock
Currently the pick_next_task() loop is convoluted and ugly because of how it can drop the rq->lock and needs to restart the picking. For the RT/Deadline classes, it is put_prev_task() where we do balancing, and we could do this before the picking loop. Make this possible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/e4519f6850477ab7f3d257062796e6425ee4ba7c.1559129225.git.vpillai@digitalocean.com |
||
|
|
03b7fad167 |
sched: Add task_struct pointer to sched_class::set_curr_task
In preparation of further separating pick_next_task() and set_curr_task() we have to pass the actual task into it, while there, rename the thing to better pair with put_prev_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com |
||
|
|
10e7071b2f |
sched: Rework CPU hotplug task selection
The CPU hotplug task selection is the only place where we used put_prev_task() on a task that is not current. While looking at that, it occured to me that we can simplify all that by by using a custom pick loop. Since we don't need to put current, we can do away with the fake task too. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> |
||
|
|
5feeb7837a |
sched: Fix kerneldoc comment for ia64_set_curr_task
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/fde3a65ea3091ec6b84dac3c19639f85f452c5d1.1559129225.git.vpillai@digitalocean.com |
||
|
|
139d025cda |
sched: Clean up active_mm reference counting
The current active_mm reference counting is confusing and sub-optimal.
Rewrite the code to explicitly consider the 4 separate cases:
user -> user
When switching between two user tasks, all we need to consider
is switch_mm().
user -> kernel
When switching from a user task to a kernel task (which
doesn't have an associated mm) we retain the last mm in our
active_mm. Increment a reference count on active_mm.
kernel -> kernel
When switching between kernel threads, all we need to do is
pass along the active_mm reference.
kernel -> user
When switching between a kernel and user task, we must switch
from the last active_mm to the next mm, hoping of course that
these are the same. Decrement a reference on the active_mm.
The code keeps a different order, because as you'll note, both 'to
user' cases require switch_mm().
And where the old code would increment/decrement for the 'kernel ->
kernel' case, the new code observes this is a neutral operation and
avoids touching the reference count.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: luto@kernel.org
|
||
|
|
b55bd58555 |
time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
The TASKS03 and TREE04 rcutorture scenarios produce the following lockdep complaint: ------------------------------------------------------------------------ ================================ WARNING: inconsistent lock state 5.2.0-rc1+ #513 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes: (____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70 {IN-HARDIRQ-W} state was registered at: lock_acquire+0xb0/0x1c0 _raw_spin_lock_irqsave+0x3c/0x50 tick_broadcast_switch_to_oneshot+0xd/0x40 tick_switch_to_oneshot+0x4f/0xd0 hrtimer_run_queues+0xf3/0x130 run_local_timers+0x1c/0x50 update_process_times+0x1c/0x50 tick_periodic+0x26/0xc0 tick_handle_periodic+0x1a/0x60 smp_apic_timer_interrupt+0x80/0x2a0 apic_timer_interrupt+0xf/0x20 _raw_spin_unlock_irqrestore+0x4e/0x60 rcu_nocb_gp_kthread+0x15d/0x590 kthread+0xf3/0x130 ret_from_fork+0x3a/0x50 irq event stamp: 171 hardirqs last enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(tick_broadcast_lock); <Interrupt> lock(tick_broadcast_lock); *** DEADLOCK *** 1 lock held by migration/1/14: #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30 stack backtrace: CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x5e/0x8b print_usage_bug+0x1fc/0x216 ? print_shortest_lock_dependencies+0x1b0/0x1b0 mark_lock+0x1f2/0x280 __lock_acquire+0x1e0/0x18f0 ? __lock_acquire+0x21b/0x18f0 ? _raw_spin_unlock_irqrestore+0x4e/0x60 lock_acquire+0xb0/0x1c0 ? tick_broadcast_offline+0xf/0x70 _raw_spin_lock+0x33/0x40 ? tick_broadcast_offline+0xf/0x70 tick_broadcast_offline+0xf/0x70 tick_offline_cpu+0x16/0x30 take_cpu_down+0x7d/0xa0 multi_cpu_stop+0xa2/0xe0 ? cpu_stop_queue_work+0xc0/0xc0 cpu_stopper_thread+0x6d/0x100 smpboot_thread_fn+0x169/0x240 kthread+0xf3/0x130 ? sort_range+0x20/0x20 ? kthread_cancel_delayed_work_sync+0x10/0x10 ret_from_fork+0x3a/0x50 ------------------------------------------------------------------------ To reproduce, run the following rcutorture test: tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04" It turns out that tick_broadcast_offline() was an innocent bystander. After all, interrupts are supposed to be disabled throughout take_cpu_down(), and therefore should have been disabled upon entry to tick_offline_cpu() and thus to tick_broadcast_offline(). This suggests that one of the CPU-hotplug notifiers was incorrectly enabling interrupts, and leaving them enabled on return. Some debugging code showed that the culprit was sched_cpu_dying(). It had irqs enabled after return from sched_tick_stop(). Which in turn had irqs enabled after return from cancel_delayed_work_sync(). Which is a wrapper around __cancel_work_timer(). Which can sleep in the case where something else is concurrently trying to cancel the same delayed work, and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad idea when you are invoked from take_cpu_down(), regardless of the state you leave interrupts in upon return. Code inspection located no reason why the delayed work absolutely needed to be canceled from sched_tick_stop(): The work is not bound to the outgoing CPU by design, given that the whole point is to collect statistics without disturbing the outgoing CPU. This commit therefore simply drops the cancel_delayed_work_sync() from sched_tick_stop(). Instead, a new ->state field is added to the tick_work structure so that the delayed-work handler function sched_tick_remote() can avoid reposting itself. A cpu_is_offline() check is also added to sched_tick_remote() to avoid mucking with the state of an offlined CPU (though it does appear safe to do so). The sched_tick_start() and sched_tick_stop() functions also update ->state, and sched_tick_start() also schedules the delayed work if ->state indicates that it is not already in flight. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> [ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> |
||
|
|
d5096aa65a |
sched: Mark hrtimers to expire in hard interrupt context
The scheduler related hrtimers need to expire in hard interrupt context even on PREEMPT_RT enabled kernels. Mark then as such. No functional change. [ tglx: Split out from larger combo patch. Add changelog. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190726185753.077004842@linutronix.de |
||
|
|
c1a280b68d |
sched/preempt: Use CONFIG_PREEMPTION where appropriate
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same functionality which today depends on CONFIG_PREEMPT. Switch the preemption code, scheduler and init task over to use CONFIG_PREEMPTION. That's the first step towards RT in that area. The more complex changes are coming separately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20190726212124.117528401@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
a1dc0446d6 |
sched/core: Silence a warning in sched_init()
Compiling a kernel with both FAIR_GROUP_SCHED=n and RT_GROUP_SCHED=n will generate a compiler warning: kernel/sched/core.c: In function 'sched_init': kernel/sched/core.c:5906:32: warning: variable 'ptr' set but not used It is unnecessary to have both "alloc_size" and "ptr", so just combine them. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: valentin.schneider@arm.com Link: https://lkml.kernel.org/r/20190720012319.884-1-cai@lca.pw Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
a07db5c086 |
sched/core: Fix CPU controller for !RT_GROUP_SCHED
On !CONFIG_RT_GROUP_SCHED configurations it is currently not possible to move RT tasks between cgroups to which CPU controller has been attached; but it is oddly possible to first move tasks around and then make them RT (setschedule to FIFO/RR). E.g.: # mkdir /sys/fs/cgroup/cpu,cpuacct/group1 # chrt -fp 10 $$ # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks bash: echo: write error: Invalid argument # chrt -op 0 $$ # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks # chrt -fp 10 $$ # cat /sys/fs/cgroup/cpu,cpuacct/group1/tasks 2345 2598 # chrt -p 2345 pid 2345's current scheduling policy: SCHED_FIFO pid 2345's current scheduling priority: 10 Also, as Michal noted, it is currently not possible to enable CPU controller on unified hierarchy with !CONFIG_RT_GROUP_SCHED (if there are any kernel RT threads in root cgroup, they can't be migrated to the newly created CPU controller's root in cgroup_update_dfl_csses()). Existing code comes with a comment saying the "we don't support RT-tasks being in separate groups". Such comment is however stale and belongs to pre-RT_GROUP_SCHED times. Also, it doesn't make much sense for !RT_GROUP_ SCHED configurations, since checks related to RT bandwidth are not performed at all in these cases. Make moving RT tasks between CPU controller groups viable by removing special case check for RT (and DEADLINE) tasks. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: rostedt@goodmis.org Link: https://lkml.kernel.org/r/20190719063455.27328-1-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
710da3c8ea |
sched/core: Prevent race condition between cpuset and __sched_setscheduler()
No synchronisation mechanism exists between the cpuset subsystem and calls to function __sched_setscheduler(). As such, it is possible that new root domains are created on the cpuset side while a deadline acceptance test is carried out in __sched_setscheduler(), leading to a potential oversell of CPU bandwidth. Grab cpuset_rwsem read lock from core scheduler, so to prevent situations such as the one described above from happening. The only exception is normalize_rt_tasks() which needs to work under tasklist_lock and can't therefore grab cpuset_rwsem. We are fine with this, as this function is only called by sysrq and, if that gets triggered, DEADLINE guarantees are already gone out of the window anyway. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: rostedt@goodmis.org Cc: tj@kernel.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-9-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
4b211f2b12 |
sched/core: Streamle calls to task_rq_unlock()
Calls to task_rq_unlock() are done several times in the __sched_setscheduler() function. This is fine when only the rq lock needs to be handled but not so much when other locks come into play. This patch streamlines the release of the rq lock so that only one location need to be modified when dealing with more than one lock. No change of functionality is introduced by this patch. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-3-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
84ec3a0787 |
time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint The TASKS03 and TREE04 rcutorture scenarios produce the following lockdep complaint: WARNING: inconsistent lock state 5.2.0-rc1+ #513 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes: (____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70 {IN-HARDIRQ-W} state was registered at: lock_acquire+0xb0/0x1c0 _raw_spin_lock_irqsave+0x3c/0x50 tick_broadcast_switch_to_oneshot+0xd/0x40 tick_switch_to_oneshot+0x4f/0xd0 hrtimer_run_queues+0xf3/0x130 run_local_timers+0x1c/0x50 update_process_times+0x1c/0x50 tick_periodic+0x26/0xc0 tick_handle_periodic+0x1a/0x60 smp_apic_timer_interrupt+0x80/0x2a0 apic_timer_interrupt+0xf/0x20 _raw_spin_unlock_irqrestore+0x4e/0x60 rcu_nocb_gp_kthread+0x15d/0x590 kthread+0xf3/0x130 ret_from_fork+0x3a/0x50 irq event stamp: 171 hardirqs last enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0 softirqs last disabled at (0): [<0000000000000000>] 0x0 [...] To reproduce, run the following rcutorture test: $ tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04" It turns out that tick_broadcast_offline() was an innocent bystander. After all, interrupts are supposed to be disabled throughout take_cpu_down(), and therefore should have been disabled upon entry to tick_offline_cpu() and thus to tick_broadcast_offline(). This suggests that one of the CPU-hotplug notifiers was incorrectly enabling interrupts, and leaving them enabled on return. Some debugging code showed that the culprit was sched_cpu_dying(). It had irqs enabled after return from sched_tick_stop(). Which in turn had irqs enabled after return from cancel_delayed_work_sync(). Which is a wrapper around __cancel_work_timer(). Which can sleep in the case where something else is concurrently trying to cancel the same delayed work, and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad idea when you are invoked from take_cpu_down(), regardless of the state you leave interrupts in upon return. Code inspection located no reason why the delayed work absolutely needed to be canceled from sched_tick_stop(): The work is not bound to the outgoing CPU by design, given that the whole point is to collect statistics without disturbing the outgoing CPU. This commit therefore simply drops the cancel_delayed_work_sync() from sched_tick_stop(). Instead, a new ->state field is added to the tick_work structure so that the delayed-work handler function sched_tick_remote() can avoid reposting itself. A cpu_is_offline() check is also added to sched_tick_remote() to avoid mucking with the state of an offlined CPU (though it does appear safe to do so). The sched_tick_start() and sched_tick_stop() functions also update ->state, and sched_tick_start() also schedules the delayed work if ->state indicates that it is not already in flight. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> [ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190625165238.GJ26519@linux.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
e3d85487fb |
sched/core: Fix preempt warning in ttwu
John reported a DEBUG_PREEMPT warning caused by commit: |
||
|
|
9d20ad7dfc |
sched/uclamp: Add uclamp_util_with()
So far uclamp_util() allows to clamp a specified utilization considering the clamp values requested by RUNNABLE tasks in a CPU. For the Energy Aware Scheduler (EAS) it is interesting to test how clamp values will change when a task is becoming RUNNABLE on a given CPU. For example, EAS is interested in comparing the energy impact of different scheduling decisions and the clamp values can play a role on that. Add uclamp_util_with() which allows to clamp a given utilization by considering the possible impact on CPU clamp values of a specified task. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-11-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1a00d99997 |
sched/uclamp: Set default clamps for RT tasks
By default FAIR tasks start without clamps, i.e. neither boosted nor
capped, and they run at the best frequency matching their utilization
demand. This default behavior does not fit RT tasks which instead are
expected to run at the maximum available frequency, if not otherwise
required by explicitly capping them.
Enforce the correct behavior for RT tasks by setting util_min to max
whenever:
1. the task is switched to the RT class and it does not already have a
user-defined clamp value assigned.
2. an RT task is forked from a parent with RESET_ON_FORK set.
NOTE: utilization clamp values are cross scheduling class attributes and
thus they are never changed/reset once a value has been explicitly
defined from user-space.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-9-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
a87498ace5 |
sched/uclamp: Reset uclamp values on RESET_ON_FORK
A forked tasks gets the same clamp values of its parent however, when
the RESET_ON_FORK flag is set on parent, e.g. via:
sys_sched_setattr()
sched_setattr()
__sched_setscheduler(attr::SCHED_FLAG_RESET_ON_FORK)
the new forked task is expected to start with all attributes reset to
default values.
Do that for utilization clamp values too by checking the reset request
from the existing uclamp_fork() call which already provides the required
initialization for other uclamp related bits.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-8-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
a509a7cd79 |
sched/uclamp: Extend sched_setattr() to support utilization clamping
The SCHED_DEADLINE scheduling class provides an advanced and formal model to define tasks requirements that can translate into proper decisions for both task placements and frequencies selections. Other classes have a more simplified model based on the POSIX concept of priorities. Such a simple priority based model however does not allow to exploit most advanced features of the Linux scheduler like, for example, driving frequencies selection via the schedutil cpufreq governor. However, also for non SCHED_DEADLINE tasks, it's still interesting to define tasks properties to support scheduler decisions. Utilization clamping exposes to user-space a new set of per-task attributes the scheduler can use as hints about the expected/required utilization for a task. This allows to implement a "proactive" per-task frequency control policy, a more advanced policy than the current one based just on "passive" measured task utilization. For example, it's possible to boost interactive tasks (e.g. to get better performance) or cap background tasks (e.g. to be more energy/thermal efficient). Introduce a new API to set utilization clamping values for a specified task by extending sched_setattr(), a syscall which already allows to define task specific properties for different scheduling classes. A new pair of attributes allows to specify a minimum and maximum utilization the scheduler can consider for a task. Do that by validating the required clamp values before and then applying the required changes using _the_ same pattern already in use for __setscheduler(). This ensures that the task is re-enqueued with the new clamp values. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-7-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1d6362fa0c |
sched/core: Allow sched_setattr() to use the current policy
The sched_setattr() syscall mandates that a policy is always specified. This requires to always know which policy a task will have when attributes are configured and this makes it impossible to add more generic task attributes valid across different scheduling policies. Reading the policy before setting generic tasks attributes is racy since we cannot be sure it is not changed concurrently. Introduce the required support to change generic task attributes without affecting the current task policy. This is done by adding an attribute flag (SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy. Add support for the SETPARAM_POLICY policy, which is already used by the sched_setparam() POSIX syscall, to the sched_setattr() non-POSIX syscall. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-6-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
e8f14172c6 |
sched/uclamp: Add system default clamps
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.
Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.
Add a privileged interface to define a system default configuration via:
/proc/sys/kernel/sched_uclamp_util_{min,max}
which works as an unconditional clamp range restriction for all tasks.
With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.
Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.
The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.
An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
e496187da7 |
sched/uclamp: Enforce last task's UCLAMP_MAX
When a task sleeps it removes its max utilization clamp from its CPU.
However, the blocked utilization on that CPU can be higher than the max
clamp value enforced while the task was running. This allows undesired
CPU frequency increases while a CPU is idle, for example, when another
CPU on the same frequency domain triggers a frequency update, since
schedutil can now see the full not clamped blocked utilization of the
idle CPU.
Fix this by using:
uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)
to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition.
Don't track any minimum utilization clamps since an idle CPU never
requires a minimum frequency. The decay of the blocked utilization is
good enough to reduce the CPU frequency.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-4-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
60daf9c194 |
sched/uclamp: Add bucket local max tracking
Because of bucketization, different task-specific clamp values are tracked in the same bucket. For example, with 20% bucket size and assuming to have: Task1: util_min=25% Task2: util_min=35% both tasks will be refcounted in the [20..39]% bucket and always boosted only up to 20% thus implementing a simple floor aggregation normally used in histograms. In systems with only few and well-defined clamp values, it would be useful to track the exact clamp value required by a task whenever possible. For example, if a system requires only 23% and 47% boost values then it's possible to track the exact boost required by each task using only 3 buckets of ~33% size each. Introduce a mechanism to max aggregate the requested clamp values of RUNNABLE tasks in the same bucket. Keep it simple by resetting the bucket value to its base value only when a bucket becomes inactive. Allow a limited and controlled overboosting margin for tasks recounted in the same bucket. In systems where the boost values are not known in advance, it is still possible to control the maximum acceptable overboosting margin by tuning the number of clamp groups. For example, 20 groups ensure a 5% maximum overboost. Remove the rq bucket initialization code since a correct bucket value is now computed when a task is refcounted into a CPU's rq. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-3-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
69842cba9a |
sched/uclamp: Add CPU's clamp buckets refcounting
Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.
When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. If the set of active clamp buckets
changes for a CPU a new "aggregated" clamp value is computed for that
CPU. This is because each clamp bucket enforces a different utilization
clamp value.
Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no task can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).
A task has:
task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).
A runqueue has:
rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many RUNNABLE tasks on that CPU refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).
It also has a:
rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).
The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
needed to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when it's dequeued the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In this case,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.
The expected number of different clamp values configured at build time
is small enough to fit the full unordered array into a single cache
line, for configurations of up to 7 buckets.
Add to struct rq the basic data structures required to refcount the
number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.
Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
a056a5bed7 |
sched/debug: Export the newly added tracepoints
So that external modules can hook into them and extract the info they need. Since these new tracepoints have no events associated with them exporting these tracepoints make them useful for external modules to perform testing and debugging. There's no other way otherwise to access them. BPF doesn't have infrastructure to access these bare tracepoints either. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavankumar Kondeti <pkondeti@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de> Link: https://lkml.kernel.org/r/20190604111459.2862-7-qais.yousef@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
aacedf26fb |
sched/core: Optimize try_to_wake_up() for local wakeups
Jens reported that significant performance can be had on some block
workloads by special casing local wakeups. That is, wakeups on the
current task before it schedules out.
Given something like the normal wait pattern:
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (cond)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
Any wakeup (on this CPU) after set_current_state() and before
schedule() would benefit from this.
Normal wakeups take p->pi_lock, which serializes wakeups to the same
task. By eliding that we gain concurrency on:
- ttwu_stat(); we already had concurrency on rq stats, this now also
brings it to task stats. -ENOCARE
- tracepoints; it is now possible to get multiple instances of
trace_sched_waking() (and possibly trace_sched_wakeup()) for the
same task. Tracers will have to learn to cope.
Furthermore, p->pi_lock is used by set_special_state(), to order
against TASK_RUNNING stores from other CPUs. But since this is
strictly CPU local, we don't need the lock, and set_special_state()'s
disabling of IRQs is sufficient.
After the normal wakeup takes p->pi_lock it issues
smp_mb__after_spinlock(), in order to ensure the woken task must
observe prior stores before we observe the p->state. If this is CPU
local, this will be satisfied with a compiler barrier, and we rely on
try_to_wake_up() being a funcation call, which implies such.
Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
would continue into the ttwu_remote() branch, which normally is
concerned with exactly this wakeup scenario, except from a remote CPU.
IOW we're waking a task that is still running. In this case, we can
trivially avoid taking rq->lock, all that's left from this is to set
p->state.
This then yields an extremely simple and fast path for 'p == current'.
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: gkohli@codeaurora.org
Cc: hch@lst.de
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
e3b929b0a1 |
sched/core: Add __sched tag for io_schedule()
Non-inline io_schedule() was introduced in: commit |
||
|
|
55627e3cd2 |
sched/core: Remove rq->cpu_load[]
The per rq load array values also disappear from the cpu#X sections in /proc/sched_debug. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-5-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
5e83eafbfd |
sched/fair: Remove the rq->cpu_load[] update code
With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx] any more. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
3bd3706251 |
sched/core: Provide a pointer to the valid CPU mask
In commit:
|
||
|
|
457c899653 |
treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
e00d413575 |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Make nohz housekeeping processing more permissive and less
intrusive to isolated CPUs
- Decouple CPU-bound workqueue acconting from the scheduler and move
it into the workqueue code.
- Optimize topology building
- Better handle quota and period overflows
- Add more RCU annotations
- Comment updates, misc cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
nohz_full: Allow the boot CPU to be nohz_full
sched/isolation: Require a present CPU in housekeeping mask
kernel/cpu: Allow non-zero CPU to be primary for suspend / kexec freeze
power/suspend: Add function to disable secondaries for suspend
sched/core: Allow the remote scheduler tick to be started on CPU0
sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs
sched/debug: Fix spelling mistake "logaritmic" -> "logarithmic"
sched/topology: Update init_sched_domains() comment
cgroup/cpuset: Update stale generate_sched_domains() comments
sched/core: Check quota and period overflow at usec to nsec conversion
sched/core: Handle overflow in cpu_shares_write_u64
sched/rt: Check integer overflow at usec to nsec conversion
sched/core: Fix typo in comment
sched/core: Make some functions static
sched/core: Unify p->on_rq updates
sched/core: Remove ttwu_activate()
sched/core, workqueues: Distangle worker accounting from rq lock
sched/fair: Remove unneeded prototype of capacity_of()
sched/topology: Skip duplicate group rewrites in build_sched_groups()
sched/topology: Fix build_sched_groups() comment
...
|
||
|
|
77a5352ba9 |
sched/core: Allow the remote scheduler tick to be started on CPU0
This has no effect yet because CPU0 will always be a housekeeping CPU until a later change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linuxppc-dev@lists.ozlabs.org Link: https://lkml.kernel.org/r/20190411033448.20842-2-npiggin@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1a8b4540db |
sched/core: Check quota and period overflow at usec to nsec conversion
Large values could overflow u64 and pass following sanity checks. # echo 18446744073750000 > cpu.cfs_period_us # cat cpu.cfs_period_us 40448 # echo 18446744073750000 > cpu.cfs_quota_us # cat cpu.cfs_quota_us 40448 After this patch they will fail with -EINVAL. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
5b61d50ab4 |
sched/core: Handle overflow in cpu_shares_write_u64
Bit shift in scale_load() could overflow shares. This patch saturates it to MAX_SHARES like following sched_group_set_shares(). Example: # echo 9223372036854776832 > cpu.shares # cat cpu.shares Before patch: 1024 After pattch: 262144 Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
bee9853932 |
sched/core: Fix typo in comment
Signed-off-by: Joel Savitz <jsavitz@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: trivial@kernel.org Link: http://lkml.kernel.org/r/1551921213-813-1-git-send-email-jsavitz@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
b1546edcf2 |
sched/core: Make some functions static
Fix these sparse warnings: kernel/sched/core.c:6577:11: warning: symbol 'min_cfs_quota_period' was not declared. Should it be static? kernel/sched/core.c:6657:5: warning: symbol 'tg_set_cfs_quota' was not declared. Should it be static? kernel/sched/core.c:6670:6: warning: symbol 'tg_get_cfs_quota' was not declared. Should it be static? kernel/sched/core.c:6683:5: warning: symbol 'tg_set_cfs_period' was not declared. Should it be static? kernel/sched/core.c:6693:6: warning: symbol 'tg_get_cfs_period' was not declared. Should it be static? kernel/sched/fair.c:2596:6: warning: symbol 'task_tick_numa' was not declared. Should it be static? Signed-off-by: YueHaibing <yuehaibing@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20190418144713.34332-1-yuehaibing@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
7dd7788411 |
sched/core: Unify p->on_rq updates
Almost all {,de}activate_task() invocations pair with p->on_rq
updates, the exception being the usage in rt/deadline which hold both
rq locks and therefore don't strictly need to set
TASK_ON_RQ_MIGRATING, but it is harmless if we do anyway.
Put the updates in {,de}activate_task() and cut down on repetition.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
1b174a2cb6 |
sched/core: Remove ttwu_activate()
After the removal of try_to_wake_up_local(), there is only one user of ttwu_activate() left, and since it is a trivial function, remove it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
6d25be5782 |
sched/core, workqueues: Distangle worker accounting from rq lock
The worker accounting for CPU bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. When nr_running is updated when the task is retunrning from schedule() then it is later compared when it is done from ttwu(). [ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
6455959819 |
ia64/tlb: Eradicate tlb_migrate_finish() callback
Only ia64-sn2 uses this as an optimization, and there it is of questionable correctness due to the mm_users==1 test. Remove it entirely. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
231c807a60 |
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
"Third more careful attempt for this set of fixes:
- Prevent a 32bit math overflow in the cpufreq code
- Fix a buffer overflow when scanning the cgroup2 cpu.max property
- A set of fixes for the NOHZ scheduler logic to prevent waking up
CPUs even if the capacity of the busy CPUs is sufficient along with
other tweaks optimizing the behaviour for asymmetric systems
(big/little)"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Skip LLC NOHZ logic for asymmetric systems
sched/fair: Tune down misfit NOHZ kicks
sched/fair: Comment some nohz_balancer_kick() kick conditions
sched/core: Fix buffer overflow in cgroup2 property cpu.max
sched/cpufreq: Fix 32-bit math overflow
|
||
|
|
4c47acd824 |
sched/core: Fix buffer overflow in cgroup2 property cpu.max
Add limit into sscanf format string for on-stack buffer.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
|
||
|
|
8dcd175bc3 |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: - a few misc things - ocfs2 updates - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits) tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include proc: more robust bulk read test proc: test /proc/*/maps, smaps, smaps_rollup, statm proc: use seq_puts() everywhere proc: read kernel cpu stat pointer once proc: remove unused argument in proc_pid_lookup() fs/proc/thread_self.c: code cleanup for proc_setup_thread_self() fs/proc/self.c: code cleanup for proc_setup_self() proc: return exit code 4 for skipped tests mm,mremap: bail out earlier in mremap_to under map pressure mm/sparse: fix a bad comparison mm/memory.c: do_fault: avoid usage of stale vm_area_struct writeback: fix inode cgroup switching comment mm/huge_memory.c: fix "orig_pud" set but not used mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC mm/memcontrol.c: fix bad line in comment mm/cma.c: cma_declare_contiguous: correct err handling mm/page_ext.c: fix an imbalance with kmemleak mm/compaction: pass pgdat to too_many_isolated() instead of zone mm: remove zone_lru_lock() function, access ->lru_lock directly ... |
||
|
|
45802da05e |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- refcount conversions
- Solve the rq->leaf_cfs_rq_list can of worms for real.
- improve power-aware scheduling
- add sysctl knob for Energy Aware Scheduling
- documentation updates
- misc other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
kthread: Do not use TIMER_IRQSAFE
kthread: Convert worker lock to raw spinlock
sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
sched/fair: Remove unused 'sd' parameter from select_idle_smt()
sched/wait: Use freezable_schedule() when possible
sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
sched/fair: Explain LLC nohz kick condition
sched/fair: Simplify nohz_balancer_kick()
sched/topology: Fix percpu data types in struct sd_data & struct s_data
sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
sched/fair: Fix O(nr_cgroups) in the load balancing path
sched/fair: Optimize update_blocked_averages()
sched/fair: Fix insertion in rq->leaf_cfs_rq_list
sched/fair: Add tmp_alone_branch assertion
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
sched/fair: Update scale invariance of PELT
sched/fair: Move the rq_of() helper function
sched/core: Convert task_struct.stack_refcount to refcount_t
...
|
||
|
|
3478588b51 |
Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"The biggest part of this tree is the new auto-generated atomics API
wrappers by Mark Rutland.
The primary motivation was to allow instrumentation without uglifying
the primary source code.
The linecount increase comes from adding the auto-generated files to
the Git space as well:
include/asm-generic/atomic-instrumented.h | 1689 ++++++++++++++++--
include/asm-generic/atomic-long.h | 1174 ++++++++++---
include/linux/atomic-fallback.h | 2295 +++++++++++++++++++++++++
include/linux/atomic.h | 1241 +------------
I preferred this approach, so that the full call stack of the (already
complex) locking APIs is still fully visible in 'git grep'.
But if this is excessive we could certainly hide them.
There's a separate build-time mechanism to determine whether the
headers are out of date (they should never be stale if we do our job
right).
Anyway, nothing from this should be visible to regular kernel
developers.
Other changes:
- Add support for dynamic keys, which removes a source of false
positives in the workqueue code, among other things (Bart Van
Assche)
- Updates to tools/memory-model (Andrea Parri, Paul E. McKenney)
- qspinlock, wake_q and lockdep micro-optimizations (Waiman Long)
- misc other updates and enhancements"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/lockdep: Shrink struct lock_class_key
locking/lockdep: Add module_param to enable consistency checks
lockdep/lib/tests: Test dynamic key registration
lockdep/lib/tests: Fix run_tests.sh
kernel/workqueue: Use dynamic lockdep keys for workqueues
locking/lockdep: Add support for dynamic keys
locking/lockdep: Verify whether lock objects are small enough to be used as class keys
locking/lockdep: Check data structure consistency
locking/lockdep: Reuse lock chains that have been freed
locking/lockdep: Fix a comment in add_chain_cache()
locking/lockdep: Introduce lockdep_next_lockchain() and lock_chain_count()
locking/lockdep: Reuse list entries that are no longer in use
locking/lockdep: Free lock classes that are no longer in use
locking/lockdep: Update two outdated comments
locking/lockdep: Make it easy to detect whether or not inside a selftest
locking/lockdep: Split lockdep_free_key_range() and lockdep_reset_lock()
locking/lockdep: Initialize the locks_before and locks_after lists earlier
locking/lockdep: Make zap_class() remove all matching lock order entries
locking/lockdep: Reorder struct lock_class members
locking/lockdep: Avoid that add_chain_cache() adds an invalid chain to the cache
...
|
||
|
|
5e1f0f098b |
mm, compaction: capture a page under direct compaction
Compaction is inherently race-prone as a suitable page freed during
compaction can be allocated by any parallel task. This patch uses a
capture_control structure to isolate a page immediately when it is freed
by a direct compactor in the slow path of the page allocator. The
intent is to avoid redundant scanning.
5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)
Latency is only moderately affected but the devil is in the details. A
closer examination indicates that base page fault latency is reduced but
latency of huge pages is increased as it takes creater care to succeed.
Part of the "problem" is that allocation success rates are close to 100%
even when under pressure and compaction gets harder
5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)
And scan rates are reduced as expected by 6% for the migration scanner
and 29% for the free scanner indicating that there is less redundant
work.
Compaction migrate scanned 20815362 19573286
Compaction free scanned 16352612 11510663
[mgorman@techsingularity.net: remove redundant check]
Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
b1b988a6a0 |
Merge branch 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull year 2038 updates from Thomas Gleixner:
"Another round of changes to make the kernel ready for 2038. After lots
of preparatory work this is the first set of syscalls which are 2038
safe:
403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64
The syscall numbers are identical all over the architectures"
* 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
riscv: Use latest system call ABI
checksyscalls: fix up mq_timedreceive and stat exceptions
unicore32: Fix __ARCH_WANT_STAT64 definition
asm-generic: Make time32 syscall numbers optional
asm-generic: Drop getrlimit and setrlimit syscalls from default list
32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option
compat ABI: use non-compat openat and open_by_handle_at variants
y2038: add 64-bit time_t syscalls to all 32-bit architectures
y2038: rename old time and utime syscalls
y2038: remove struct definition redirects
y2038: use time32 syscall names on 32-bit
syscalls: remove obsolete __IGNORE_ macros
y2038: syscalls: rename y2038 compat syscalls
x86/x32: use time64 versions of sigtimedwait and recvmmsg
timex: change syscalls to use struct __kernel_timex
timex: use __kernel_timex internally
sparc64: add custom adjtimex/clock_adjtime functions
time: fix sys_timer_settime prototype
time: Add struct __kernel_timex
time: make adjtime compat handling available for 32 bit
...
|
||
|
|
568f196756 |
bpf: check that BPF programs run with preemption disabled
Introduce cant_sleep() macro for annotation of functions that cannot sleep. Use it in BPF_PROG_RUN to catch execution of BPF programs in preemptable context. Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
||
|
|
d0fe0b9c45 |
sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
Since commit:
|
||
|
|
c9ba7560c5 |
Linux 5.0-rc6
-----BEGIN PGP SIGNATURE----- iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR A6Ktjw== =scY4 -----END PGP SIGNATURE----- Merge tag 'v5.0-rc6' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
41ea39101d |
y2038: Add time64 system calls
This series finally gets us to the point of having system calls with 64-bit time_t on all architectures, after a long time of incremental preparation patches. There was actually one conversion that I missed during the summer, i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes and review comments. The following system calls are now added on all 32-bit architectures using the same system call numbers: 403 clock_gettime64 404 clock_settime64 405 clock_adjtime64 406 clock_getres_time64 407 clock_nanosleep_time64 408 timer_gettime64 409 timer_settime64 410 timerfd_gettime64 411 timerfd_settime64 412 utimensat_time64 413 pselect6_time64 414 ppoll_time64 416 io_pgetevents_time64 417 recvmmsg_time64 418 mq_timedsend_time64 419 mq_timedreceiv_time64 420 semtimedop_time64 421 rt_sigtimedwait_time64 422 futex_time64 423 sched_rr_get_interval_time64 Each one of these corresponds directly to an existing system call that includes a 'struct timespec' argument, or a structure containing a timespec or (in case of clock_adjtime) timeval. Not included here are new versions of getitimer/setitimer and getrusage/waitid, which are planned for the future but only needed to make a consistent API rather than for correct operation beyond y2038. These four system calls are based on 'timeval', and it has not been finally decided what the replacement kernel interface will use instead. So far, I have done a lot of build testing across most architectures, which has found a number of bugs. Runtime testing so far included testing LTP on 32-bit ARM with the existing system calls, to ensure we do not regress for existing binaries, and a test with a 32-bit x86 build of LTP against a modified version of the musl C library that has been adapted to the new system call interface [3]. This library can be used for testing on all architectures supported by musl-1.1.21, but it is not how the support is getting integrated into the official musl release. Official musl support is planned but will require more invasive changes to the library. Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/ Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/ Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2] Signed-off-by: Arnd Bergmann <arnd@arndb.de> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJcXf7/AAoJEGCrR//JCVInPSUP/RhsQSCKMGtONB/vVICQhwep PybhzBSpHWFxszzTi6BEPN1zS9B069G9mDollRBYZCckyPqL/Bv6sI/vzQZdNk01 Q6Nw92OnNE1QP8owZ5TjrZhpbtopWdqIXjsbGZlloUemvuJP2JwvKovQUcn5CPTQ jbnqU04CVyFFJYVxAnGJ+VSeWNrjW/cm/m+rhLFjUcwW7Y3aodxsPqPP6+K9hY9P yIWfcH42WBeEWGm1RSBOZOScQl4SGCPUAhFydl/TqyEQagyegJMIyMOv9wZ5AuTT xK644bDVmNsrtJDZDpx+J8hytXCk1LrnKzkHR/uK80iUIraF/8D7PlaPgTmEEjko XcrywEkvkXTVU3owCm2/sbV+8fyFKzSPipnNfN1JNxEX71A98kvMRtPjDueQq/GA Yh81rr2YLF2sUiArkc2fNpENT7EGhrh1q6gviK3FB8YDgj1kSgPK5wC/X0uolC35 E7iC2kg4NaNEIjhKP/WKluCaTvjRbvV+0IrlJLlhLTnsqbA57ZKCCteiBrlm7wQN 4csUtCyxchR9Ac2o/lj+Mf53z68Zv74haIROp18K2dL7ZpVcOPnA3XHeauSAdoyp wy2Ek6ilNvlNB+4x+mRntPoOsyuOUGv7JXzB9JvweLWUd9G7tvYeDJQp/0YpDppb K4UWcKnhtEom0DgK08vY =IZVb -----END PGP SIGNATURE----- Merge tag 'y2038-new-syscalls' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground into timers/2038 Pull y2038 - time64 system calls from Arnd Bergmann: This series finally gets us to the point of having system calls with 64-bit time_t on all architectures, after a long time of incremental preparation patches. There was actually one conversion that I missed during the summer, i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes and review comments. The following system calls are now added on all 32-bit architectures using the same system call numbers: 403 clock_gettime64 404 clock_settime64 405 clock_adjtime64 406 clock_getres_time64 407 clock_nanosleep_time64 408 timer_gettime64 409 timer_settime64 410 timerfd_gettime64 411 timerfd_settime64 412 utimensat_time64 413 pselect6_time64 414 ppoll_time64 416 io_pgetevents_time64 417 recvmmsg_time64 418 mq_timedsend_time64 419 mq_timedreceiv_time64 420 semtimedop_time64 421 rt_sigtimedwait_time64 422 futex_time64 423 sched_rr_get_interval_time64 Each one of these corresponds directly to an existing system call that includes a 'struct timespec' argument, or a structure containing a timespec or (in case of clock_adjtime) timeval. Not included here are new versions of getitimer/setitimer and getrusage/waitid, which are planned for the future but only needed to make a consistent API rather than for correct operation beyond y2038. These four system calls are based on 'timeval', and it has not been finally decided what the replacement kernel interface will use instead. So far, I have done a lot of build testing across most architectures, which has found a number of bugs. Runtime testing so far included testing LTP on 32-bit ARM with the existing system calls, to ensure we do not regress for existing binaries, and a test with a 32-bit x86 build of LTP against a modified version of the musl C library that has been adapted to the new system call interface [3]. This library can be used for testing on all architectures supported by musl-1.1.21, but it is not how the support is getting integrated into the official musl release. Official musl support is planned but will require more invasive changes to the library. Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/ Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/ Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2] |
||
|
|
8dabe7245b |
y2038: syscalls: rename y2038 compat syscalls
A lot of system calls that pass a time_t somewhere have an implementation using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have been reworked so that this implementation can now be used on 32-bit architectures as well. The missing step is to redefine them using the regular SYSCALL_DEFINEx() to get them out of the compat namespace and make it possible to build them on 32-bit architectures. Any system call that ends in 'time' gets a '32' suffix on its name for that version, while the others get a '_time32' suffix, to distinguish them from the normal version, which takes a 64-bit time argument in the future. In this step, only 64-bit architectures are changed, doing this rename first lets us avoid touching the 32-bit architectures twice. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> |
||
|
|
c546951d9c |
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
move_queued_task() synchronizes with task_rq_lock() as follows: move_queued_task() task_rq_lock() [S] ->on_rq = MIGRATING [L] rq = task_rq() WMB (__set_task_cpu()) ACQUIRE (rq->lock); [S] ->cpu = new_cpu [L] ->on_rq where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before "[L] ->on_rq" by the ACQUIRE itself. Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor this address dependency. Also, mark the accesses to ->cpu and ->on_rq with READ_ONCE()/WRITE_ONCE() to comply with the LKMM. Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
2312729688 |
sched/fair: Update scale invariance of PELT
The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.
The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :
U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
with U is the max util_avg value = SCHED_CAPACITY_SCALE
At a lower capacity, the range becomes:
U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p)
with C reflecting the compute capacity ratio between current capacity and
max capacity.
so C tries to compensate changes in (1-y^r') but it can't be accurate.
Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.
In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.
In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:
On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.
each test runs 16 times:
./perf bench sched pipe
(higher is better)
kernel tip/sched/core + patch
ops/seconds ops/seconds diff
cgroup
root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38%
level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57%
level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86%
hackbench -l 1000
(lower is better)
kernel tip/sched/core + patch
duration(sec) duration(sec) diff
cgroup
root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57%
level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60%
level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66%
Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:
Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
972 (95%) 138ms not reachable 276ms
486 (47.5%) 30ms 138ms 60ms
256 (25%) 13ms 32ms 26ms
On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: patrick.bellasi@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
07879c6a37 |
sched/wake_q: Reduce reference counting for special users
Some users, specifically futexes and rwsems, required fixes that allowed the callers to be safe when wakeups occur before they are expected by wake_up_q(). Such scenarios also play games and rely on reference counting, and until now were pivoting on wake_q doing it. With the wake_q_add() call being moved down, this can no longer be the case. As such we end up with a a double task refcounting overhead; and these callers care enough about this (being rather core-ish). This patch introduces a wake_q_add_safe() call that serves for callers that have already done refcounting and therefore the task is 'safe' from wake_q point of view (int that it requires reference throughout the entire queue/>wakeup cycle). In the one case it has internal reference counting, in the other case it consumes the reference counting. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Xie Yongji <xieyongji@baidu.com> Cc: Yongji Xie <elohimes@gmail.com> Cc: andrea.parri@amarulasolutions.com Cc: lilin24@baidu.com Cc: liuqi16@baidu.com Cc: nixun@baidu.com Cc: yuanlinsi01@baidu.com Cc: zhangyu31@baidu.com Link: https://lkml.kernel.org/r/20181218195352.7orq3upiwfdbrdne@linux-r8p5 Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
b5a4e2bb0f |
Revert "sched/core: Take the hotplug lock in sched_init_smp()"
This reverts commit
|
||
|
|
87ff19cb2f |
sched/wake_q: Add branch prediction hint to wake_q_add() cmpxchg
The cmpxchg() will fail when the task is already in the process of waking up, and as such is an extremely rare occurrence. Micro-optimize the call and put an unlikely() around it. To no surprise, when using CONFIG_PROFILE_ANNOTATED_BRANCHES under a number of workloads the incorrect rate was a mere 1-2%. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yongji Xie <elohimes@gmail.com> Cc: andrea.parri@amarulasolutions.com Cc: lilin24@baidu.com Cc: liuqi16@baidu.com Cc: nixun@baidu.com Cc: xieyongji@baidu.com Cc: yuanlinsi01@baidu.com Cc: zhangyu31@baidu.com Link: https://lkml.kernel.org/r/20181203053130.gwkw6kg72azt2npb@linux-r8p5 Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
4c4e373156 |
sched/wake_q: Fix wakeup ordering for wake_q
Notable cmpxchg() does not provide ordering when it fails, however
wake_q_add() requires ordering in this specific case too. Without this
it would be possible for the concurrent wakeup to not observe our
prior state.
Andrea Parri provided:
C wake_up_q-wake_q_add
{
int next = 0;
int y = 0;
}
P0(int *next, int *y)
{
int r0;
/* in wake_up_q() */
WRITE_ONCE(*next, 1); /* node->next = NULL */
smp_mb(); /* implied by wake_up_process() */
r0 = READ_ONCE(*y);
}
P1(int *next, int *y)
{
int r1;
/* in wake_q_add() */
WRITE_ONCE(*y, 1); /* wake_cond = true */
smp_mb__before_atomic();
r1 = cmpxchg_relaxed(next, 1, 2);
}
exists (0:r0=0 /\ 1:r1=0)
This "exists" clause cannot be satisfied according to the LKMM:
Test wake_up_q-wake_q_add Allowed
States 3
0:r0=0; 1:r1=1;
0:r0=1; 1:r1=0;
0:r0=1; 1:r1=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:r0=0 /\ 1:r1=0)
Observation wake_up_q-wake_q_add Never 0 3
Reported-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
e6018c0f5c |
sched/wake_q: Document wake_q_add()
The only guarantee provided by wake_q_add() is that a wakeup will happen after it, it does _NOT_ guarantee the wakeup will be delayed until the matching wake_up_q(). If wake_q_add() fails the cmpxchg() a concurrent wakeup is pending and that can happen at any time after the cmpxchg(). This means we should not rely on the wakeup happening at wake_q_up(), but should be ready for wake_q_add() to issue the wakeup. The delay; if provided (most likely); should only result in more efficient behaviour. Reported-by: Yongji Xie <elohimes@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
e9666d10a5 |
jump_label: move 'asm goto' support test to Kconfig
Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label". The jump label is controlled by HAVE_JUMP_LABEL, which is defined like this: #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) # define HAVE_JUMP_LABEL #endif We can improve this by testing 'asm goto' support in Kconfig, then make JUMP_LABEL depend on CC_HAS_ASM_GOTO. Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will match to the real kernel capability. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Tested-by: Sedat Dilek <sedat.dilek@gmail.com> |
||
|
|
a65981109f |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: - procfs updates - various misc bits - lib/ updates - epoll updates - autofs - fatfs - a few more MM bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits) mm/page_io.c: fix polled swap page in checkpatch: add Co-developed-by to signature tags docs: fix Co-Developed-by docs drivers/base/platform.c: kmemleak ignore a known leak fs: don't open code lru_to_page() fs/: remove caller signal_pending branch predictions mm/: remove caller signal_pending branch predictions arch/arc/mm/fault.c: remove caller signal_pending_branch predictions kernel/sched/: remove caller signal_pending branch predictions kernel/locking/mutex.c: remove caller signal_pending branch predictions mm: select HAVE_MOVE_PMD on x86 for faster mremap mm: speed up mremap by 20x on large regions mm: treewide: remove unused address argument from pte_alloc functions initramfs: cleanup incomplete rootfs scripts/gdb: fix lx-version string output kernel/kcov.c: mark write_comp_data() as notrace kernel/sysctl: add panic_print into sysctl panic: add options to print system info when panic happens bfs: extra sanity checking and static inode bitmap exec: separate MM_ANONPAGES and RLIMIT_STACK accounting ... |
||
|
|
34ec35ad8f |
kernel/sched/: remove caller signal_pending branch predictions
This is already done for us internally by the signal machinery. Link: http://lkml.kernel.org/r/20181116002713.8474-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
96d4f267e4 |
Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
17bf423a1f |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Introduce "Energy Aware Scheduling" - by Quentin Perret.
This is a coherent topology description of CPUs in cooperation with
the PM subsystem, with the goal to schedule more energy-efficiently
on asymetric SMP platform - such as waking up tasks to the more
energy-efficient CPUs first, as long as the system isn't
oversubscribed.
For details of the design, see:
https://lore.kernel.org/lkml/20180724122521.22109-1-quentin.perret@arm.com/
- Misc cleanups and smaller enhancements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
sched/fair: Select an energy-efficient CPU on task wake-up
sched/fair: Introduce an energy estimation helper function
sched/fair: Add over-utilization/tipping point indicator
sched/fair: Clean-up update_sg_lb_stats parameters
sched/toplogy: Introduce the 'sched_energy_present' static key
sched/topology: Make Energy Aware Scheduling depend on schedutil
sched/topology: Disable EAS on inappropriate platforms
sched/topology: Add lowest CPU asymmetry sched_domain level pointer
sched/topology: Reference the Energy Model of CPUs when available
PM: Introduce an Energy Model management framework
sched/cpufreq: Prepare schedutil for Energy Aware Scheduling
sched/topology: Relocate arch_scale_cpu_capacity() to the internal header
sched/core: Remove unnecessary unlikely() in push_*_task()
sched/topology: Remove the ::smt_gain field from 'struct sched_domain'
sched: Fix various typos in comments
sched/core: Clean up the #ifdef block in add_nr_running()
sched/fair: Make some variables static
sched/core: Create task_has_idle_policy() helper
sched/fair: Add lsub_positive() and use it consistently
sched/fair: Mask UTIL_AVG_UNCHANGED usages
...
|