mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-04 20:19:47 +08:00
28d74fc36a
124 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
026b98a93b |
sched/numa: Adjust imb_numa_nr to a better approximation of memory channels
For a single LLC per node, a NUMA imbalance is allowed up until 25% of CPUs sharing a node could be active. One intent of the cut-off is to avoid an imbalance of memory channels but there is no topological information based on active memory channels. Furthermore, there can be differences between nodes depending on the number of populated DIMMs. A cut-off of 25% was arbitrary but generally worked. It does have a severe corner cases though when an parallel workload is using 25% of all available CPUs over-saturates memory channels. This can happen due to the initial forking of tasks that get pulled more to one node after early wakeups (e.g. a barrier synchronisation) that is not quickly corrected by the load balancer. The LB may fail to act quickly as the parallel tasks are considered to be poor migrate candidates due to locality or cache hotness. On a range of modern Intel CPUs, 12.5% appears to be a better cut-off assuming all memory channels are populated and is used as the new cut-off point. A minimum of 1 is specified to allow a communicating pair to remain local even for CPUs with low numbers of cores. For modern AMDs, there are multiple LLCs and are not affected. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net |
||
|
|
8a0441415b |
sched: Move energy_aware sysctls to topology.c
move energy_aware sysctls to topology.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> |
||
|
|
ccacfe56d7 |
Merge branch 'sched/fast-headers' into sched/core
Merge the scheduler build speedup of the fast-headers tree.
Cumulative scheduler (kernel/sched/) build time speedup on a
Linux distribution's config, which enables all scheduler features,
compared to the vanilla kernel:
_____________________________________________________________________________
|
| Vanilla kernel (v5.13-rc7):
|_____________________________________________________________________________
|
| Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
|
| 126,975,564,374 instructions # 1.45 insn per cycle ( +- 0.00% )
| 87,637,847,671 cycles # 3.959 GHz ( +- 0.30% )
| 22,136.96 msec cpu-clock # 7.499 CPUs utilized ( +- 0.29% )
|
| 2.9520 +- 0.0169 seconds time elapsed ( +- 0.57% )
|_____________________________________________________________________________
|
| Patched kernel:
|_____________________________________________________________________________
|
| Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
|
| 50,420,496,914 instructions # 1.47 insn per cycle ( +- 0.00% )
| 34,234,322,038 cycles # 3.946 GHz ( +- 0.31% )
| 8,675.81 msec cpu-clock # 3.053 CPUs utilized ( +- 0.45% )
|
| 2.8420 +- 0.0181 seconds time elapsed ( +- 0.64% )
|_____________________________________________________________________________
Summary:
- CPU time used to build the scheduler dropped by -60.9%, a reduction
from 22.1 clock-seconds to 8.7 clock-seconds.
- Wall-clock time to build the scheduler dropped by -3.9%, a reduction
from 2.95 seconds to 2.84 seconds.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
7f434dff76 |
sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains
While investigating the sparse warning reported by the LKP bot [1], observed that we have a redundant variable "top" in the function build_sched_domains that was introduced in the recent commit |
||
|
|
801c141955 |
sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there
Collect all utility functionality source code files into a single kernel/sched/build_utility.c file,
via #include-ing the .c files:
kernel/sched/clock.c
kernel/sched/completion.c
kernel/sched/loadavg.c
kernel/sched/swait.c
kernel/sched/wait_bit.c
kernel/sched/wait.c
CONFIG_CPU_FREQ:
kernel/sched/cpufreq.c
CONFIG_CPU_FREQ_GOV_SCHEDUTIL:
kernel/sched/cpufreq_schedutil.c
CONFIG_CGROUP_CPUACCT:
kernel/sched/cpuacct.c
CONFIG_SCHED_DEBUG:
kernel/sched/debug.c
CONFIG_SCHEDSTATS:
kernel/sched/stats.c
CONFIG_SMP:
kernel/sched/cpupri.c
kernel/sched/stop_task.c
kernel/sched/topology.c
CONFIG_SCHED_CORE:
kernel/sched/core_sched.c
CONFIG_PSI:
kernel/sched/psi.c
CONFIG_MEMBARRIER:
kernel/sched/membarrier.c
CONFIG_CPU_ISOLATION:
kernel/sched/isolation.c
CONFIG_SCHED_AUTOGROUP:
kernel/sched/autogroup.c
The goal is to amortize the 60+ KLOC header bloat from over a dozen build units into
a single build unit.
The build time of build_utility.c also roughly matches the build time of core.c and
fair.c - allowing better load-balancing of scheduler-only rebuilds.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
|
||
|
|
04d4e665a6 |
sched/isolation: Use single feature type while referring to housekeeping cpumask
Refer to housekeeping APIs using single feature types instead of flags. This prevents from passing multiple isolation features at once to housekeeping interfaces, which soon won't be possible anymore as each isolation features will have their own cpumask. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org |
||
|
|
0fb3978b0a |
sched/numa: Fix NUMA topology for systems with CPU-less nodes
The NUMA topology parameters (sched_numa_topology_type, sched_domains_numa_levels, and sched_max_numa_distance, etc.) identified by scheduler may be wrong for systems with CPU-less nodes. For example, the ACPI SLIT of a system with CPU-less persistent memory (Intel Optane DCPMM) nodes is as follows, [000h 0000 4] Signature : "SLIT" [System Locality Information Table] [004h 0004 4] Table Length : 0000042C [008h 0008 1] Revision : 01 [009h 0009 1] Checksum : 59 [00Ah 0010 6] Oem ID : "XXXX" [010h 0016 8] Oem Table ID : "XXXXXXX" [018h 0024 4] Oem Revision : 00000001 [01Ch 0028 4] Asl Compiler ID : "INTL" [020h 0032 4] Asl Compiler Revision : 20091013 [024h 0036 8] Localities : 0000000000000004 [02Ch 0044 4] Locality 0 : 0A 15 11 1C [030h 0048 4] Locality 1 : 15 0A 1C 11 [034h 0052 4] Locality 2 : 11 1C 0A 1C [038h 0056 4] Locality 3 : 1C 11 1C 0A While the `numactl -H` output is as follows, available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 64136 MB node 0 free: 5981 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 1 size: 64466 MB node 1 free: 10415 MB node 2 cpus: node 2 size: 253952 MB node 2 free: 253920 MB node 3 cpus: node 3 size: 253952 MB node 3 free: 253951 MB node distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10 In this system, there are only 2 sockets. In each memory controller, both DRAM and PMEM DIMMs are installed. Although the physical NUMA topology is simple, the logical NUMA topology becomes a little complex. Because both the distance(0, 1) and distance (1, 3) are less than the distance (0, 3), it appears that node 1 sits between node 0 and node 3. And the whole system appears to be a glueless mesh NUMA topology type. But it's definitely not, there is even no CPU in node 3. This isn't a practical problem now yet. Because the PMEM nodes (node 2 and node 3 in example system) are offlined by default during system boot. So init_numa_topology_type() called during system boot will ignore them and set sched_numa_topology_type to NUMA_DIRECT. And init_numa_topology_type() is only called at runtime when a CPU of a never-onlined-before node gets plugged in. And there's no CPU in the PMEM nodes. But it appears better to fix this to make the code more robust. To test the potential problem. We have used a debug patch to call init_numa_topology_type() when the PMEM node is onlined (in __set_migration_target_nodes()). With that, the NUMA parameters identified by scheduler is as follows, sched_numa_topology_type: NUMA_GLUELESS_MESH sched_domains_numa_levels: 4 sched_max_numa_distance: 28 To fix the issue, the CPU-less nodes are ignored when the NUMA topology parameters are identified. Because a node may become CPU-less or not at run time because of CPU hotplug, the NUMA topology parameters need to be re-initialized at runtime for CPU hotplug too. With the patch, the NUMA parameters identified for the example system above is as follows, sched_numa_topology_type: NUMA_DIRECT sched_domains_numa_levels: 2 sched_max_numa_distance: 21 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com |
||
|
|
1087ad4e3f |
sched: replace cpumask_weight with cpumask_empty where appropriate
In some places, kernel/sched code calls cpumask_weight() to check if any bit of a given cpumask is set. We can do it more efficiently with cpumask_empty() because cpumask_empty() stops traversing the cpumask as soon as it finds first set bit, while cpumask_weight() counts all bits unconditionally. Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220210224933.379149-23-yury.norov@gmail.com |
||
|
|
e496132ebe |
sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Commit
|
||
|
|
512b7931ad |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ... |
||
|
|
61bb6cd2f7 |
mm: move node_reclaim_distance to fix NUMA without SMP
Patch series "Fix NUMA without SMP".
SuperH is the only architecture which still supports NUMA without SMP,
for good reasons (various memories scattered around the address space,
each with varying latencies).
This series fixes two build errors due to variables and functions used
by the NUMA code being provided by SMP-only source files or sections.
This patch (of 2):
If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):
sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'
Fix this by moving the declaration of node_reclaim_distance from an
SMP-only to a generic file.
Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be
Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be
Fixes:
|
||
|
|
e60b56e46b |
sched/fair: Wait before decaying max_newidle_lb_cost
Decay max_newidle_lb_cost only when it has not been updated for a while and ensure to not decay a recently changed value. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org |
||
|
|
da6ff09943 |
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
The push-IPI logic for RT tasks expects to be invoked from hardirq context. One reason is that a RT task on the remote CPU would block the softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT tasks as intended. Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211006111852.1514359-2-bigeasy@linutronix.de |
||
|
|
778c558f49 |
sched: Add cluster scheduler level in core and related Kconfig for ARM64
This patch adds scheduler level for clusters and automatically enables
the load balance among clusters. It will directly benefit a lot of
workload which loves more resources such as memory bandwidth, caches.
Testing has widely been done in two different hardware configurations of
Kunpeng920:
24 cores in one NUMA(6 clusters in each NUMA node);
32 cores in one NUMA(8 clusters in each NUMA node)
Workload is running on either one NUMA node or four NUMA nodes, thus,
this can estimate the effect of cluster spreading w/ and w/o NUMA load
balance.
* Stream benchmark:
4threads stream (on 1NUMA * 24cores = 24cores)
stream stream
w/o patch w/ patch
MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%)
MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%)
MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%)
MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%)
6threads stream (on 1NUMA * 24cores = 24cores)
stream stream
w/o patch w/ patch
MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%)
MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%)
MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%)
MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%)
12threads stream (on 1NUMA * 24cores = 24cores)
stream stream
w/o patch w/ patch
MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%)
MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%)
MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%)
MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%)
Thus, it could help memory-bound workload especially under medium load.
Similar improvement is also seen in lkp-pbzip2:
* lkp-pbzip2 benchmark
2-96 threads (on 4NUMA * 24cores = 96cores)
lkp-pbzip2 lkp-pbzip2
w/o patch w/ patch
Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%*
Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%*
Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%*
Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%*
Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%*
Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%*
Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%*
Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%*
Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%*
2-24 threads (on 1NUMA * 24cores = 24cores)
lkp-pbzip2 lkp-pbzip2
w/o patch w/ patch
Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%*
Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%*
Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%*
Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%*
Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%*
Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%)
Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%*
In the case of 6 threads and 8 threads, we see the greatest performance
improvement.
Similar improvement can be seen on lkp-pixz though the improvement is
smaller:
* lkp-pixz benchmark
2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
lkp-pixz lkp-pixz
w/o patch w/ patch
Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%*
Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%)
Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%*
Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%*
Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%*
Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%*
Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%*
* SPECrate benchmark
4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
Base Base
Run Time Rate
------- ---------
4 Copies w/o 580 (w/ 570) w/o 11.1 (w/ 11.3)
8 Copies w/o 647 (w/ 605) w/o 20.0 (w/ 21.4, +7%)
16 Copies w/o 844 (w/ 844) w/o 30.6 (w/ 30.6)
32 Copies(on 4NUMA * 32 cores = 128cores)
[w/o patch]
Base Base Base
Benchmarks Copies Run Time Rate
--------------- ------- --------- ---------
500.perlbench_r 32 584 87.2 *
502.gcc_r 32 503 90.2 *
505.mcf_r 32 745 69.4 *
520.omnetpp_r 32 1031 40.7 *
523.xalancbmk_r 32 597 56.6 *
525.x264_r 1 -- CE
531.deepsjeng_r 32 336 109 *
541.leela_r 32 556 95.4 *
548.exchange2_r 32 513 163 *
557.xz_r 32 530 65.2 *
Est. SPECrate2017_int_base 80.3
[w/ patch]
Base Base Base
Benchmarks Copies Run Time Rate
--------------- ------- --------- ---------
500.perlbench_r 32 580 87.8 (+0.688%) *
502.gcc_r 32 477 95.1 (+5.432%) *
505.mcf_r 32 644 80.3 (+13.574%) *
520.omnetpp_r 32 942 44.6 (+9.58%) *
523.xalancbmk_r 32 560 60.4 (+6.714%%) *
525.x264_r 1 -- CE
531.deepsjeng_r 32 337 109 (+0.000%) *
541.leela_r 32 554 95.6 (+0.210%) *
548.exchange2_r 32 515 163 (+0.000%) *
557.xz_r 32 524 66.0 (+1.227%) *
Est. SPECrate2017_int_base 83.7 (+4.062%)
On the other hand, it is slightly helpful to CPU-bound tasks like
kernbench:
* 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
kernbench kernbench
w/o cluster w/ cluster
Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%)
Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%)
Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%)
Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%)
Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%)
Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%)
Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%)
Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%)
Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%)
Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%*
Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%*
Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%)
Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%*
Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%*
Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%*
Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%)
Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%*
Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%)
Note this patch isn't an universal win, it might hurt those workload
which can benefit from packing. Though tasks which want to take
advantages of lower communication latency of one cluster won't
necessarily been packed in one cluster while kernel is not aware of
clusters, they have some chance to be randomly packed. But this
patch will make them more likely spread.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
||
|
|
f9ec6fea20 |
sched/topology: Remove unused numa_distance in cpu_attach_domain()
numa_distance in cpu_attach_domain() is introduced in commit |
||
|
|
16d364ba6e |
sched/topology: Introduce sched_group::flags
There exist situations in which the load balance needs to know the properties of the CPUs in a scheduling group. When using asymmetric packing, for instance, the load balancer needs to know not only the state of dst_cpu but also of its SMT siblings, if any. Use the flags of the child scheduling domains to initialize scheduling group flags. This will reflect the properties of the CPUs in the group. A subsequent changeset will make use of these new flags. No functional changes are introduced. Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Len Brown <len.brown@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210911011819.12184-3-ricardo.neri-calderon@linux.intel.com |
||
|
|
0083242c93 |
sched/topology: Skip updating masks for non-online nodes
The scheduler currently expects NUMA node distances to be stable from init onwards, and as a consequence builds the related data structures once-and-for-all at init (see sched_init_numa()). Unfortunately, on some architectures node distance is unreliable for offline nodes and may very well change upon onlining. Skip over offline nodes during sched_init_numa(). Track nodes that have been onlined at least once, and trigger a build of a node's NUMA masks when it is first onlined post-init. Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210818074333.48645-1-srikar@linux.vnet.ibm.com |
||
|
|
c744dc4ab5 |
sched/topology: Rework CPU capacity asymmetry detection
Currently the CPU capacity asymmetry detection, performed through asym_cpu_capacity_level, tries to identify the lowest topology level at which the highest CPU capacity is being observed, not necessarily finding the level at which all possible capacity values are visible to all CPUs, which might be bit problematic for some possible/valid asymmetric topologies i.e.: DIE [ ] MC [ ][ ] CPU [0] [1] [2] [3] [4] [5] [6] [7] Capacity |.....| |.....| |.....| |.....| L M B B Where: arch_scale_cpu_capacity(L) = 512 arch_scale_cpu_capacity(M) = 871 arch_scale_cpu_capacity(B) = 1024 In this particular case, the asymmetric topology level will point at MC, as all possible CPU masks for that level do cover the CPU with the highest capacity. It will work just fine for the first cluster, not so much for the second one though (consider the find_energy_efficient_cpu which might end up attempting the energy aware wake-up for a domain that does not see any asymmetry at all) Rework the way the capacity asymmetry levels are being detected, allowing to point to the lowest topology level (for a given CPU), where full set of available CPU capacities is visible to all CPUs within given domain. As a result, the per-cpu sd_asym_cpucapacity might differ across the domains. This will have an impact on EAS wake-up placement in a way that it might see different range of CPUs to be considered, depending on the given current and target CPUs. Additionally, those levels, where any range of asymmetry (not necessarily full) is being detected will get identified as well. The selected asymmetric topology level will be denoted by SD_ASYM_CPUCAPACITY_FULL sched domain flag whereas the 'sub-levels' would receive the already used SD_ASYM_CPUCAPACITY flag. This allows maintaining the current behaviour for asymmetric topologies, with misfit migration operating correctly on lower levels, if applicable, as any asymmetry is enough to trigger the misfit migration. The logic there relies on the SD_ASYM_CPUCAPACITY flag and does not relate to the full asymmetry level denoted by the sd_asym_cpucapacity pointer. Detecting the CPU capacity asymmetry is being based on a set of available CPU capacities for all possible CPUs. This data is being generated upon init and updated once CPU topology changes are being detected (through arch_update_cpu_topology). As such, any changes to identified CPU capacities (like initializing cpufreq) need to be explicitly advertised by corresponding archs to trigger rebuilding the data. Additional -dflags- parameter, used when building sched domains, has been removed as well, as the asymmetry flags are now being set directly in sd_init. Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Beata Michalska <beata.michalska@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20210603140627.8409-3-beata.michalska@arm.com |
||
|
|
5cb9eaa3d2 |
sched: Wrap rq::lock access
In preparation of playing games with rq->lock, abstract the thing using an accessor. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org |
||
|
|
9406415f46 |
sched/debug: Rename the sched_debug parameter to sched_verbose
CONFIG_SCHED_DEBUG is the build-time Kconfig knob, the boot param sched_debug and the /debug/sched/debug_enabled knobs control the sched_debug_enabled variable, but what they really do is make SCHED_DEBUG more verbose, so rename the lot. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> |
||
|
|
3b87f136f8 |
sched,debug: Convert sysctl sched_domains to debugfs
Stop polluting sysctl, move to debugfs for SCHED_DEBUG stuff. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/YHgB/s4KCBQ1ifdm@hirez.programming.kicks-ass.net |
||
|
|
0a2b65c03e |
sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group()
mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so it must be a subset of sched_group_span(sg). So the cpumask_and() call is redundant - remove it. [ mingo: Adjusted the changelog a bit. ] Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Link: https://lore.kernel.org/r/20210325023140.23456-1-song.bao.hua@hisilicon.com |
||
|
|
3b03706fa6 |
sched: Fix various typos
Fix ~42 single-word typos in scheduler code comments. We have accumulated a few fun ones over the years. :-) Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: linux-kernel@vger.kernel.org |
||
|
|
585b6d2723 |
sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
As long as NUMA diameter > 2, building sched_domain by sibling's child
domain will definitely create a sched_domain with sched_group which will
span out of the sched_domain:
+------+ +------+ +-------+ +------+
| node | 12 |node | 20 | node | 12 |node |
| 0 +---------+1 +--------+ 2 +-------+3 |
+------+ +------+ +-------+ +------+
domain0 node0 node1 node2 node3
domain1 node0+1 node0+1 node2+3 node2+3
+
domain2 node0+1+2 |
group: node0+1 |
group:node2+3 <-------------------+
when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of the domain including node0+1+2.
This will make load_balance() run based on screwed avg_load and group_type
in the sched_group spanning out of the sched_domain, and it also makes
select_task_rq_fair() pick an idle CPU outside the sched_domain.
Real servers which suffer from this problem include Kunpeng920 and 8-node
Sun Fire X4600-M2, at least.
Here we move to use the *child* domain of the *child* domain of node2's
domain2 as the new added sched_group. At the same, we re-use the lower
level sgc directly.
+------+ +------+ +-------+ +------+
| node | 12 |node | 20 | node | 12 |node |
| 0 +---------+1 +--------+ 2 +-------+3 |
+------+ +------+ +-------+ +------+
domain0 node0 node1 +- node2 node3
|
domain1 node0+1 node0+1 | node2+3 node2+3
|
domain2 node0+1+2 |
group: node0+1 |
group:node2 <-------------------+
While the lower level sgc is re-used, this patch only changes the remote
sched_groups for those sched_domains playing grandchild trick, therefore,
sgc->next_update is still safe since it's only touched by CPUs that have
the group span as local group. And sgc->imbalance is also safe because
sd_parent remains the same in load_balance and LB only tries other CPUs
from the local group.
Moreover, since local groups are not touched, they are still getting
roughly equal size in a TL. And should_we_balance() only matters with
local groups, so the pull probability of those groups are still roughly
equal.
Tested by the below topology:
qemu-system-aarch64 -M virt -nographic \
-smp cpus=8 \
-numa node,cpus=0-1,nodeid=0 \
-numa node,cpus=2-3,nodeid=1 \
-numa node,cpus=4-5,nodeid=2 \
-numa node,cpus=6-7,nodeid=3 \
-numa dist,src=0,dst=1,val=12 \
-numa dist,src=0,dst=2,val=20 \
-numa dist,src=0,dst=3,val=22 \
-numa dist,src=1,dst=2,val=22 \
-numa dist,src=2,dst=3,val=12 \
-numa dist,src=1,dst=3,val=24 \
-m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image
w/o patch, we get lots of "groups don't span domain->span":
[ 0.802139] CPU0 attaching sched-domain(s):
[ 0.802193] domain-0: span=0-1 level=MC
[ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[ 0.802693] domain-1: span=0-3 level=NUMA
[ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[ 0.802811] domain-2: span=0-5 level=NUMA
[ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[ 0.802881] ERROR: groups don't span domain->span
[ 0.803058] domain-3: span=0-7 level=NUMA
[ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
[ 0.804055] CPU1 attaching sched-domain(s):
[ 0.804072] domain-0: span=0-1 level=MC
[ 0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
[ 0.804152] domain-1: span=0-3 level=NUMA
[ 0.804170] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[ 0.804219] domain-2: span=0-5 level=NUMA
[ 0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[ 0.804302] ERROR: groups don't span domain->span
[ 0.804520] domain-3: span=0-7 level=NUMA
[ 0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
[ 0.804677] CPU2 attaching sched-domain(s):
[ 0.804687] domain-0: span=2-3 level=MC
[ 0.804705] groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 }
[ 0.804754] domain-1: span=0-3 level=NUMA
[ 0.804772] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
[ 0.804820] domain-2: span=0-5 level=NUMA
[ 0.804836] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
[ 0.804944] ERROR: groups don't span domain->span
[ 0.805108] domain-3: span=0-7 level=NUMA
[ 0.805134] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
[ 0.805223] CPU3 attaching sched-domain(s):
[ 0.805232] domain-0: span=2-3 level=MC
[ 0.805249] groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 }
[ 0.805319] domain-1: span=0-3 level=NUMA
[ 0.805336] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
[ 0.805383] domain-2: span=0-5 level=NUMA
[ 0.805399] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
[ 0.805458] ERROR: groups don't span domain->span
[ 0.805605] domain-3: span=0-7 level=NUMA
[ 0.805626] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
[ 0.805712] CPU4 attaching sched-domain(s):
[ 0.805721] domain-0: span=4-5 level=MC
[ 0.805738] groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 }
[ 0.805787] domain-1: span=4-7 level=NUMA
[ 0.805803] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
[ 0.805851] domain-2: span=0-1,4-7 level=NUMA
[ 0.805867] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
[ 0.805915] ERROR: groups don't span domain->span
[ 0.806108] domain-3: span=0-7 level=NUMA
[ 0.806130] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
[ 0.806214] CPU5 attaching sched-domain(s):
[ 0.806222] domain-0: span=4-5 level=MC
[ 0.806240] groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 }
[ 0.806841] domain-1: span=4-7 level=NUMA
[ 0.806866] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
[ 0.806934] domain-2: span=0-1,4-7 level=NUMA
[ 0.806953] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
[ 0.807004] ERROR: groups don't span domain->span
[ 0.807312] domain-3: span=0-7 level=NUMA
[ 0.807386] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
[ 0.807686] CPU6 attaching sched-domain(s):
[ 0.807710] domain-0: span=6-7 level=MC
[ 0.807750] groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 }
[ 0.807840] domain-1: span=4-7 level=NUMA
[ 0.807870] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
[ 0.807952] domain-2: span=0-1,4-7 level=NUMA
[ 0.807985] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
[ 0.808045] ERROR: groups don't span domain->span
[ 0.808257] domain-3: span=0-7 level=NUMA
[ 0.808571] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 }
[ 0.808848] CPU7 attaching sched-domain(s):
[ 0.808860] domain-0: span=6-7 level=MC
[ 0.808880] groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 }
[ 0.808953] domain-1: span=4-7 level=NUMA
[ 0.808974] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
[ 0.809034] domain-2: span=0-1,4-7 level=NUMA
[ 0.809055] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
[ 0.809128] ERROR: groups don't span domain->span
[ 0.810361] domain-3: span=0-7 level=NUMA
[ 0.810400] groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 }
w/ patch, we don't get "groups don't span domain->span" any more:
[ 1.486271] CPU0 attaching sched-domain(s):
[ 1.486820] domain-0: span=0-1 level=MC
[ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 }
[ 1.515717] domain-1: span=0-3 level=NUMA
[ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
[ 1.516989] domain-2: span=0-5 level=NUMA
[ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
[ 1.517369] domain-3: span=0-7 level=NUMA
[ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }
[ 1.520027] CPU1 attaching sched-domain(s):
[ 1.520097] domain-0: span=0-1 level=MC
[ 1.520184] groups: 1:{ span=1 cap=994 }, 0:{ span=0 cap=980 }
[ 1.520429] domain-1: span=0-3 level=NUMA
[ 1.520487] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
[ 1.520687] domain-2: span=0-5 level=NUMA
[ 1.520744] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
[ 1.520948] domain-3: span=0-7 level=NUMA
[ 1.521038] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }
[ 1.522068] CPU2 attaching sched-domain(s):
[ 1.522348] domain-0: span=2-3 level=MC
[ 1.522606] groups: 2:{ span=2 cap=1003 }, 3:{ span=3 cap=986 }
[ 1.522832] domain-1: span=0-3 level=NUMA
[ 1.522885] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 }
[ 1.523043] domain-2: span=0-5 level=NUMA
[ 1.523092] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 }
[ 1.523302] domain-3: span=0-7 level=NUMA
[ 1.523352] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 }
[ 1.523748] CPU3 attaching sched-domain(s):
[ 1.523774] domain-0: span=2-3 level=MC
[ 1.523825] groups: 3:{ span=3 cap=986 }, 2:{ span=2 cap=1003 }
[ 1.524009] domain-1: span=0-3 level=NUMA
[ 1.524086] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 }
[ 1.524281] domain-2: span=0-5 level=NUMA
[ 1.524331] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 }
[ 1.524534] domain-3: span=0-7 level=NUMA
[ 1.524586] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 }
[ 1.524847] CPU4 attaching sched-domain(s):
[ 1.524873] domain-0: span=4-5 level=MC
[ 1.524954] groups: 4:{ span=4 cap=958 }, 5:{ span=5 cap=991 }
[ 1.525105] domain-1: span=4-7 level=NUMA
[ 1.525153] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
[ 1.525368] domain-2: span=0-1,4-7 level=NUMA
[ 1.525428] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
[ 1.532726] domain-3: span=0-7 level=NUMA
[ 1.532811] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=4037 }
[ 1.534125] CPU5 attaching sched-domain(s):
[ 1.534159] domain-0: span=4-5 level=MC
[ 1.534303] groups: 5:{ span=5 cap=991 }, 4:{ span=4 cap=958 }
[ 1.534490] domain-1: span=4-7 level=NUMA
[ 1.534572] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
[ 1.534734] domain-2: span=0-1,4-7 level=NUMA
[ 1.534783] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
[ 1.536057] domain-3: span=0-7 level=NUMA
[ 1.536430] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=3896 }
[ 1.536815] CPU6 attaching sched-domain(s):
[ 1.536846] domain-0: span=6-7 level=MC
[ 1.536934] groups: 6:{ span=6 cap=1005 }, 7:{ span=7 cap=1001 }
[ 1.537144] domain-1: span=4-7 level=NUMA
[ 1.537262] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
[ 1.537553] domain-2: span=0-1,4-7 level=NUMA
[ 1.537613] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 }
[ 1.537872] domain-3: span=0-7 level=NUMA
[ 1.537998] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }
[ 1.538448] CPU7 attaching sched-domain(s):
[ 1.538505] domain-0: span=6-7 level=MC
[ 1.538586] groups: 7:{ span=7 cap=1001 }, 6:{ span=6 cap=1005 }
[ 1.538746] domain-1: span=4-7 level=NUMA
[ 1.538798] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
[ 1.539048] domain-2: span=0-1,4-7 level=NUMA
[ 1.539111] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 }
[ 1.539571] domain-3: span=0-7 level=NUMA
[ 1.539610] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Meelis Roos <mroos@linux.ee>
Link: https://lkml.kernel.org/r/20210224030944.15232-1-song.bao.hua@hisilicon.com
|
||
|
|
71e5f6644f |
sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()
Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.
This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):
sched_init_domains
build_sched_domains()
__free_domain_allocs()
__sdt_free() {
...
for_each_sd_topology(tl)
...
sd = *per_cpu_ptr(sdd->sd, j); <--
...
}
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Barry Song <song.bao.hua@hisilicon.com>
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com
|
||
|
|
620a6dc407 |
sched/topology: Make sched_init_numa() use a set for the deduplicating sort
The deduplicating sort in sched_init_numa() assumes that the first line in
the distance table contains all unique values in the entire table. I've
been trying to pen what this exactly means for the topology, but it's not
straightforward. For instance, topology.c uses this example:
node 0 1 2 3
0: 10 20 20 30
1: 20 10 20 20
2: 20 20 10 20
3: 30 20 20 10
0 ----- 1
| / |
| / |
| / |
2 ----- 3
Which works out just fine. However, if we swap nodes 0 and 1:
1 ----- 0
| / |
| / |
| / |
2 ----- 3
we get this distance table:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 30
2: 20 20 10 20
3: 20 30 20 10
Which breaks the deduplicating sort (non-representative first line). In
this case this would just be a renumbering exercise, but it so happens that
we can have a deduplicating sort that goes through the whole table in O(n²)
at the extra cost of a temporary memory allocation (i.e. any form of set).
The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
this, implement the set as a 256-bits bitmap. Should this not be
satisfactory (i.e. we want to support 32-bit values), then we'll have to go
for some other sparse set implementation.
This has the added benefit of letting us allocate just the right amount of
memory for sched_domains_numa_distance[], rather than an arbitrary
(nr_node_ids + 1).
Note: DT binding equivalent (distance-map) decodes distances as 32-bit
values.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com
|
||
|
|
fa50e2b452 |
sched/topology: Condition EAS enablement on FIE support
In order to make accurate predictions across CPUs and for all performance states, Energy Aware Scheduling (EAS) needs frequency-invariant load tracking signals. EAS task placement aims to minimize energy consumption, and does so in part by limiting the search space to only CPUs with the highest spare capacity (CPU capacity - CPU utilization) in their performance domain. Those candidates are the placement choices that will keep frequency at its lowest possible and therefore save the most energy. But without frequency invariance, a CPU's utilization is relative to the CPU's current performance level, and not relative to its maximum performance level, which determines its capacity. As a result, it will fail to correctly indicate any potential spare capacity obtained by an increase in a CPU's performance level. Therefore, a non-invariant utilization signal would render the EAS task placement logic invalid. Now that we properly report support for the Frequency Invariance Engine (FIE) through arch_scale_freq_invariant() for arm and arm64 systems, while also ensuring a re-evaluation of the EAS use conditions for possible invariance status change, we can assert this is the case when initializing EAS. Warn and bail out otherwise. Suggested-by: Quentin Perret <qperret@google.com> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201027180713.7642-4-ionela.voinescu@arm.com |
||
|
|
31f6a8c0a4 |
sched/topology,schedutil: Wrap sched domains rebuild
Add the rebuild_sched_domains_energy() function to wrap the functionality that rebuilds the scheduling domains if any of the Energy Aware Scheduling (EAS) initialisation conditions change. This functionality is used when schedutil is added or removed or when EAS is enabled or disabled through the sched_energy_aware sysctl. Therefore, create a single function that is used in both these cases and that can be later reused. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Quentin Perret <qperret@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20201027180713.7642-2-ionela.voinescu@arm.com |
||
|
|
b5b217346d |
sched/topology: Warn when NUMA diameter > 2
NUMA topologies where the shortest path between some two nodes requires
three or more hops (i.e. diameter > 2) end up being misrepresented in the
scheduler topology structures.
This is currently detected when booting a kernel with CONFIG_SCHED_DEBUG=y
+ sched_debug on the cmdline, although this will only yield a warning about
sched_group spans not matching sched_domain spans:
ERROR: groups don't span domain->span
Add an explicit warning for that case, triggered regardless of
CONFIG_SCHED_DEBUG, and decorate it with an appropriate comment.
The topology described in the comment can be booted up on QEMU by appending
the following to your usual QEMU incantation:
-smp cores=4 \
-numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
-numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
-numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
-numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \
-numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20
A somewhat more realistic topology (6-node mesh) with the same affliction
can be conjured with:
-smp cores=6 \
-numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
-numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
-numa node,cpus=4,nodeid=4, -numa node,cpus=5,nodeid=5, \
-numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
-numa dist,src=0,dst=3,val=40, -numa dist,src=0,dst=4,val=30, \
-numa dist,src=0,dst=5,val=20, \
-numa dist,src=1,dst=2,val=20, -numa dist,src=1,dst=3,val=30, \
-numa dist,src=1,dst=4,val=20, -numa dist,src=1,dst=5,val=30, \
-numa dist,src=2,dst=3,val=20, -numa dist,src=2,dst=4,val=30, \
-numa dist,src=2,dst=5,val=40, \
-numa dist,src=3,dst=4,val=20, -numa dist,src=3,dst=5,val=30, \
-numa dist,src=4,dst=5,val=20
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com
|
||
|
|
26762423a2 |
sched/deadline: Optimize sched_dl_global_validate()
Under CONFIG_SMP, dl_bw is per root domain, but not per CPU. When checking or updating dl_bw, currently iterating every CPU is overdoing, just need iterate each root domain once. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/78d21ee792cc48ff79e8cd62a5f26208463684d6.1602171061.git.iwtbavbm@gmail.com |
||
|
|
edaa5ddf38 |
Scheduler changes for v5.10:
- Reorganize & clean up the SD* flags definitions and add a bunch
of sanity checks. These new checks caught quite a few bugs or at
least inconsistencies, resulting in another set of patches.
- Rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
- Add a new tracepoint to improve CPU capacity tracking
- Improve overloaded SMP system load-balancing behavior
- Tweak SMT balancing
- Energy-aware scheduling updates
- NUMA balancing improvements
- Deadline scheduler fixes and improvements
- CPU isolation fixes
- Misc cleanups, simplifications and smaller optimizations.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+EWRERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hV8A/7BB0nt/zYVZ8Z3Di8V0b9hMtr0d1xtRM5
ZAvg4hcZl/fVgobFndxBw6KdlK8lSce9Mcq+bTTWeD46CS13cK5Vrpiaf7x7Q00P
m8YHeYEH13ME0pbBrhDoRCR4XzfXukzjkUl7LiyrTekAvRUtFikJ/uKl8MeJtYGZ
gANEkadqforxUW0v45iUEGepmCWAl8hSlSMb2mDKsVhw4DFMD+px0EBmmA0VDqjE
e0rkh6dEoUVNqlic2KoaXULld1rLg1xiaOcLUbTAXnucfhmuv5p/H11AC4ABuf+s
7d0zLrLEfZrcLJkthYxfMHs7DYMtARiQM9Db/a5hAq9Af4Z2bvvVAaHt3gCGvkV1
llB6BB2yWCki9Qv7oiGOAhANnyJHG/cU4r6WwMuHdlYi4dFT/iN5qkOMUL1IrDgi
a6ZzvECChXBeisQXHSlMd8Y5O+j0gRvDR7E18z2q0/PlmO8PGJq4w34mEWveWIg3
LaVF16bmvaARuNFJTQH/zaHhjqVQANSMx5OIv9swp0OkwvQkw21ICYHG0YxfzWCr
oa/FESEpOL9XdYp8UwMPI0bmVIsEfx79pmDMF3zInYTpJpwMUhV2yjHE8uYVMqEf
7U8rZv7gdbZ2us38Gjf2l73hY+recp/GrgZKnk0R98OUeMk1l/iVP6dwco6ITUV5
czGmKlIB1ec=
=bXy6
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- reorganize & clean up the SD* flags definitions and add a bunch of
sanity checks. These new checks caught quite a few bugs or at least
inconsistencies, resulting in another set of patches.
- rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
- add a new tracepoint to improve CPU capacity tracking
- improve overloaded SMP system load-balancing behavior
- tweak SMT balancing
- energy-aware scheduling updates
- NUMA balancing improvements
- deadline scheduler fixes and improvements
- CPU isolation fixes
- misc cleanups, simplifications and smaller optimizations
* tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
sched/deadline: Unthrottle PI boosted threads while enqueuing
sched/debug: Add new tracepoint to track cpu_capacity
sched/fair: Tweak pick_next_entity()
rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
rseq/selftests,x86_64: Add rseq_offset_deref_addv()
rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
sched/fair: Use dst group while checking imbalance for NUMA balancer
sched/fair: Reduce busy load balance interval
sched/fair: Minimize concurrent LBs between domain level
sched/fair: Reduce minimal imbalance threshold
sched/fair: Relax constraint on task's load during load balance
sched/fair: Remove the force parameter of update_tg_load_avg()
sched/fair: Fix wrong cpu selecting from isolated domain
sched: Remove unused inline function uclamp_bucket_base_value()
sched/rt: Disable RT_RUNTIME_SHARE by default
sched/deadline: Fix stale throttling on de-/boosted tasks
sched/numa: Use runnable_avg to classify node
sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
...
|
||
|
|
6e7499135d |
sched/fair: Reduce busy load balance interval
The busy_factor, which increases load balance interval when a cpu is busy, is set to 32 by default. This value generates some huge LB interval on large system like the THX2 made of 2 node x 28 cores x 4 threads. For such system, the interval increases from 112ms to 3584ms at MC level. And from 228ms to 7168ms at NUMA level. Even on smaller system, a lower busy factor has shown improvement on the fair distribution of the running time so let reduce it for all. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org |
||
|
|
2208cdaa56 |
sched/fair: Reduce minimal imbalance threshold
The 25% default imbalance threshold for DIE and NUMA domain is large enough to generate significant unfairness between threads. A typical example is the case of 11 threads running on 2x4 CPUs. The imbalance of 20% between the 2 groups of 4 cores is just low enough to not trigger the load balance between the 2 groups. We will have always the same 6 threads on one group of 4 CPUs and the other 5 threads on the other group of CPUS. With a fair time sharing in each group, we ends up with +20% running time for the group of 5 threads. Consider decreasing the imbalance threshold for overloaded case where we use the load to balance task and to ensure fair time sharing. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Hillf Danton <hdanton@sina.com> Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org |
||
|
|
848785df48 |
sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
The last sd_flag_debug shuffle inadvertently moved its definition within
an #ifdef CONFIG_SYSCTL region. While CONFIG_SYSCTL is indeed required to
produce the sched domain ctl interface (which uses sd_flag_debug to output
flag names), it isn't required to run any assertion on the sched_domain
hierarchy itself.
Move the definition of sd_flag_debug to a CONFIG_SCHED_DEBUG region of
topology.c.
Now at long last we have:
- sd_flag_debug declared in include/linux/sched/topology.h iff
CONFIG_SCHED_DEBUG=y
- sd_flag_debug defined in kernel/sched/topology.c, conditioned by:
- CONFIG_SCHED_DEBUG, with an explicit #ifdef block
- CONFIG_SMP, as a requirement to compile topology.c
With this change, all symbols pertaining to SD flag metadata (with the
exception of __SD_FLAG_CNT) are now defined exclusively within topology.c
Fixes:
|
||
|
|
4fc472f121 |
sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
SD_DEGENERATE_GROUPS_MASK is only useful for sched/topology.c, but still
gets defined for anyone who imports topology.h, leading to a flurry of
unused variable warnings.
Move it out of the header and place it next to the SD degeneration
functions in sched/topology.c.
Fixes:
|
||
|
|
df561f6688 |
treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> |
||
|
|
3a6712c768 |
sched/topology: Mark SD_PREFER_SIBLING as SDF_NEEDS_GROUPS
SD_PREFER_SIBLING is currently considered in sd_parent_degenerate() but not in sd_degenerate(). It too hinges on load balancing, and thus won't have any effect when set on a domain with a single group. Add it to SD_DEGENERATE_GROUPS_MASK. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20200817113003.20802-12-valentin.schneider@arm.com |
||
|
|
c200191d4c |
sched/topology: Propagate SD_ASYM_CPUCAPACITY upwards
We currently set this flag *only* on domains whose topology level exactly
match the level where we detect asymmetry (as returned by
asym_cpu_capacity_level()). This is rather problematic.
Say there are two clusters in the system, one with a lone big CPU and the
other with a mix of big and LITTLE CPUs (as is allowed by DynamIQ):
DIE [ ]
MC [ ][ ]
0 1 2 3 4
L L B B B
asym_cpu_capacity_level() will figure out that the MC level is the one
where all CPUs can see a CPU of max capacity, and we will thus set
SD_ASYM_CPUCAPACITY at MC level for all CPUs.
That lone big CPU will degenerate its MC domain, since it would be alone in
there, and will end up with just a DIE domain. Since the flag was only set
at MC, this CPU ends up not seeing any SD with the flag set, which is
broken.
Rather than clearing dflags at every topology level, clear it before
entering the topology level loop. This will properly propagate upwards
flags that are set starting from a certain level.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-11-valentin.schneider@arm.com
|
||
|
|
ab65afb094 |
sched/topology: Remove SD_SERIALIZE degeneration special case
If there is only a single NUMA node in the system, the only NUMA topology level that will be generated will be NODE (identity distance), which doesn't have SD_SERIALIZE. This means we don't need this special case in sd_parent_degenerate(), as having the NODE level "naturally" covers it. Thus, remove it. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20200817113003.20802-10-valentin.schneider@arm.com |
||
|
|
6f34981862 |
sched/topology: Use prebuilt SD flag degeneration mask
Leverage SD_DEGENERATE_GROUPS_MASK in sd_degenerate() and sd_parent_degenerate(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20200817113003.20802-9-valentin.schneider@arm.com |
||
|
|
65c5e25316 |
sched/topology: Verify SD_* flags setup when sched_debug is on
Now that we have some description of what we expect the flags layout to be, we can use that to assert at runtime that the actual layout is sane. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20200817113003.20802-6-valentin.schneider@arm.com |
||
|
|
cfe7ddcbd7 |
ARM, sched/topology: Remove SD_SHARE_POWERDOMAIN
This flag was introduced in 2014 by commit:
|
||
|
|
0408497800 |
Power management updates for 5.9-rc1
- Make the Energy Model cover non-CPU devices (Lukasz Luba).
- Add Ice Lake server idle states table to the intel_idle driver
and eliminate a redundant static variable from it (Chen Yu,
Rafael Wysocki).
- Eliminate all W=1 build warnings from cpufreq (Lee Jones).
- Add support for Sapphire Rapids and for Power Limit 4 to the
Intel RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).
- Fix function name in kerneldoc comments in the idle_inject power
capping driver (Yangtao Li).
- Fix locking issues with cpufreq governors and drop a redundant
"weak" function definition from cpufreq (Viresh Kumar).
- Rearrange cpufreq to register non-modular governors at the
core_initcall level and allow the default cpufreq governor to
be specified in the kernel command line (Quentin Perret).
- Extend, fix and clean up the intel_pstate driver (Srinivas
Pandruvada, Rafael Wysocki):
* Add a new sysfs attribute for disabling/enabling CPU
energy-efficiency optimizations in the processor.
* Make the driver avoid enabling HWP if EPP is not supported.
* Allow the driver to handle numeric EPP values in the sysfs
interface and fix the setting of EPP via sysfs in the active
mode.
* Eliminate a static checker warning and clean up a kerneldoc
comment.
- Clean up some variable declarations in the powernv cpufreq
driver (Wei Yongjun).
- Fix up the ->enter_s2idle callback definition to cover the case
when it points to the same function as ->idle correctly (Neal
Liu).
- Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).
- Make the PM core emit "changed" uevent when adding/removing the
"wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).
- Add a helper macro for declaring PM callbacks and use it in the
MMC jz4740 driver (Paul Cercueil).
- Fix white space in some places in the hibernate code and make the
system-wide PM code use "const char *" where appropriate (Xiang
Chen, Alexey Dobriyan).
- Add one more "unsafe" helper macro to the freezer to cover the NFS
use case (He Zhe).
- Change the language in the generic PM domains framework to use
parent/child terminology and clean up a typo and some comment
fromatting in that code (Kees Cook, Geert Uytterhoeven).
- Update the operating performance points OPP framework (Lukasz
Luba, Andrew-sh.Cheng, Valdis Kletnieks):
* Refactor dev_pm_opp_of_register_em() and update related drivers.
* Add a missing function export.
* Allow disabled OPPs in dev_pm_opp_get_freq().
- Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):
* Add support for delayed timers to the devfreq core and make the
Samsung exynos5422-dmc driver use it.
* Unify sysfs interface to use "df-" as a prefix in instance names
consistently.
* Fix devfreq_summary debugfs node indentation.
* Add the rockchip,pmu phandle to the rk3399_dmc driver DT
bindings.
* List Dmitry Osipenko as the Tegra devfreq driver maintainer.
* Fix typos in the core devfreq code.
- Update the pm-graph utility to version 5.7 including a number of
fixes related to suspend-to-idle (Todd Brandt).
- Fix coccicheck errors and warnings in the cpupower utility (Shuah
Khan).
- Replace HTTP links with HTTPs ones in multiple places (Alexander
A. Klimov).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl8oO24SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx7ZQP/0lQ0yABnASnwomdOH6+K/m7rvc+e9FE
zx5pTDQswhU5tM7SQAIKqe0uSI+okF2UrBrT5onA16F+JUbnrbexJLazBPfVTTGF
AKpKEQ7Wh69Wz+Y6cQZjm1dTuRL+dlBJuBrzR2tLSnONPMMHuFcO3xd7lgE9UAxC
oGEf393taA6OqcUNRQIa2gqbq+k1qhKjeDucGkbOaoJ6CL0ZyWI+Tfw1WWaBBGv0
/2wBd6V513OH8WtQCW6H3YpHmhYW6OwL8w19KyGcjPRGJaeaIP4W/Ng7mkvgL5ZB
vZqg3XiufFV9uTe8W1NQaVv/NjlN256OteuK809aosTVjD0dhFkhBYg5TLu6HbQq
C/NciZ+78oLedWLT73EUfw3NyS+V0jk6X2EIlBUwNi0Qw1B1pCifGOCKzWFFe5cr
ci4xr4FG7dBkxScOxwFAU2s5TdPHLOkGkQtg4jZr0OYDrzkyLEdsnZEUjLPORo+0
6EBXGfTOSy2CBHcYswRtzJr/1pUTzj7oejhTAMCCuYW2r3VyQtnYcVjlehtp20if
6BfmGisk8nmtxlSm+/Y2FqKa4bNnSTMmr0UJQ+Rjp0tHs47QeucI0ORfZ5nPaBac
+ptvIjWmn3xejT/+oAehpH9066Iuy66vzHdnj7x5+WAsmYS8n8OFtlBFkYELmLJB
3xI5hIl7WtGo
=8cUO
-----END PGP SIGNATURE-----
Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The most significant change here is the extension of the Energy Model
to cover non-CPU devices (as well as CPUs) from Lukasz Luba.
There is also some new hardware support (Ice Lake server idle states
table for intel_idle, Sapphire Rapids and Power Limit 4 support in the
RAPL driver), some new functionality in the existing drivers (eg. a
new switch to disable/enable CPU energy-efficiency optimizations in
intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq
core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci,
devfreq), including the elimination of W=1 build warnings from cpufreq
done by Lee Jones.
Specifics:
- Make the Energy Model cover non-CPU devices (Lukasz Luba).
- Add Ice Lake server idle states table to the intel_idle driver and
eliminate a redundant static variable from it (Chen Yu, Rafael
Wysocki).
- Eliminate all W=1 build warnings from cpufreq (Lee Jones).
- Add support for Sapphire Rapids and for Power Limit 4 to the Intel
RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).
- Fix function name in kerneldoc comments in the idle_inject power
capping driver (Yangtao Li).
- Fix locking issues with cpufreq governors and drop a redundant
"weak" function definition from cpufreq (Viresh Kumar).
- Rearrange cpufreq to register non-modular governors at the
core_initcall level and allow the default cpufreq governor to be
specified in the kernel command line (Quentin Perret).
- Extend, fix and clean up the intel_pstate driver (Srinivas
Pandruvada, Rafael Wysocki):
* Add a new sysfs attribute for disabling/enabling CPU
energy-efficiency optimizations in the processor.
* Make the driver avoid enabling HWP if EPP is not supported.
* Allow the driver to handle numeric EPP values in the sysfs
interface and fix the setting of EPP via sysfs in the active
mode.
* Eliminate a static checker warning and clean up a kerneldoc
comment.
- Clean up some variable declarations in the powernv cpufreq driver
(Wei Yongjun).
- Fix up the ->enter_s2idle callback definition to cover the case
when it points to the same function as ->idle correctly (Neal Liu).
- Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).
- Make the PM core emit "changed" uevent when adding/removing the
"wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).
- Add a helper macro for declaring PM callbacks and use it in the MMC
jz4740 driver (Paul Cercueil).
- Fix white space in some places in the hibernate code and make the
system-wide PM code use "const char *" where appropriate (Xiang
Chen, Alexey Dobriyan).
- Add one more "unsafe" helper macro to the freezer to cover the NFS
use case (He Zhe).
- Change the language in the generic PM domains framework to use
parent/child terminology and clean up a typo and some comment
fromatting in that code (Kees Cook, Geert Uytterhoeven).
- Update the operating performance points OPP framework (Lukasz Luba,
Andrew-sh.Cheng, Valdis Kletnieks):
* Refactor dev_pm_opp_of_register_em() and update related drivers.
* Add a missing function export.
* Allow disabled OPPs in dev_pm_opp_get_freq().
- Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):
* Add support for delayed timers to the devfreq core and make the
Samsung exynos5422-dmc driver use it.
* Unify sysfs interface to use "df-" as a prefix in instance
names consistently.
* Fix devfreq_summary debugfs node indentation.
* Add the rockchip,pmu phandle to the rk3399_dmc driver DT
bindings.
* List Dmitry Osipenko as the Tegra devfreq driver maintainer.
* Fix typos in the core devfreq code.
- Update the pm-graph utility to version 5.7 including a number of
fixes related to suspend-to-idle (Todd Brandt).
- Fix coccicheck errors and warnings in the cpupower utility (Shuah
Khan).
- Replace HTTP links with HTTPs ones in multiple places (Alexander A.
Klimov)"
* tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits)
cpuidle: ACPI: fix 'return' with no value build warning
cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
cpufreq: intel_pstate: Rearrange the storing of new EPP values
intel_idle: Customize IceLake server support
PM / devfreq: Fix the wrong end with semicolon
PM / devfreq: Fix indentaion of devfreq_summary debugfs node
PM / devfreq: Clean up the devfreq instance name in sysfs attr
memory: samsung: exynos5422-dmc: Add module param to control IRQ mode
memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold
memory: samsung: exynos5422-dmc: Use delayed timer as default
PM / devfreq: Add support delayed timer for polling mode
dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle
PM / devfreq: tegra: Add Dmitry as a maintainer
PM / devfreq: event: Fix trivial spelling
PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent
cpuidle: change enter_s2idle() prototype
cpuidle: psci: Prevent domain idlestates until consumers are ready
cpuidle: psci: Convert PM domain to platform driver
cpuidle: psci: Fix error path via converting to a platform driver
cpuidle: psci: Fail cpuidle registration if set OSI mode failed
...
|
||
|
|
521b512b15 |
PM / EM: change naming convention from 'capacity' to 'performance'
The Energy Model uses concept of performance domain and capacity states in order to calculate power used by CPUs. Change naming convention from capacity to performance state would enable wider usage in future, e.g. upcoming support for other devices other than CPUs. Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Quentin Perret <qperret@google.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
9b1b234bb8 |
sched: correct SD_flags returned by tl->sd_flags()
During sched domain init, we check whether non-topological SD_flags are
returned by tl->sd_flags(), if found, fire a waning and correct the
violation, but the code failed to correct the violation. Correct this.
Fixes:
|
||
|
|
cb8e59cc87 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
|
||
|
|
36c5bdc438 |
sched/topology: Kill SD_LOAD_BALANCE
That flag is set unconditionally in sd_init(), and no one checks for it anymore. Remove it. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200415210512.805-5-valentin.schneider@arm.com |
||
|
|
e669ac8ab9 |
sched: Remove checks against SD_LOAD_BALANCE
The SD_LOAD_BALANCE flag is set unconditionally for all domains in sd_init(). By making the sched_domain->flags syctl interface read-only, we have removed the last piece of code that could clear that flag - as such, it will now be always present. Rather than to keep carrying it along, we can work towards getting rid of it entirely. cpusets don't need it because they can make CPUs be attached to the NULL domain (e.g. cpuset with sched_load_balance=0), or to a partitioned root_domain, i.e. a sched_domain hierarchy that doesn't span the entire system (e.g. root cpuset with sched_load_balance=0 and sibling cpusets with sched_load_balance=1). isolcpus apply the same "trick": isolated CPUs are explicitly taken out of the sched_domain rebuild (using housekeeping_cpumask()), so they get the NULL domain treatment as well. Remove the checks against SD_LOAD_BALANCE. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200415210512.805-4-valentin.schneider@arm.com |
||
|
|
32927393dc |
sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
38502ab4bf |
sched/topology: Don't enable EAS on SMT systems
EAS already requires asymmetric CPU capacities to be enabled, and mixing this with SMT is an aberration, but better be safe than sorry. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Quentin Perret <qperret@google.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200227191433.31994-2-valentin.schneider@arm.com |