Merge tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpuset changes:

    - Continue separating v1 and v2 implementations by moving more
      v1-specific logic into cpuset-v1.c

    - Improve partition handling. Sibling partitions are no longer
      invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
      fail in v2, and effective_xcpus computation is made consistent

    - Fix partition effective CPUs overlap that caused a warning on
      cpuset removal when sibling partitions shared CPUs

 - Increase the maximum cgroup subsystem count from 16 to 32 to
   accommodate future subsystem additions

 - Misc cleanups and selftest improvements including switching to
   css_is_online() helper, removing dead code and stale documentation
   references, using lockdep_assert_cpuset_lock_held() consistently, and
   adding polling helpers for asynchronously updated cgroup statistics

* tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  cpuset: fix overlap of partition effective CPUs
  cgroup: increase maximum subsystem count from 16 to 32
  cgroup: Remove stale cpu.rt.max reference from documentation
  cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held()
  cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()
  cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict
  cgroup/cpuset: Don't fail cpuset.cpus change in v2
  cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()
  cgroup/cpuset: Streamline rm_siblings_excl_cpus()
  cpuset: remove dead code in cpuset-v1.c
  cpuset: remove v1-specific code from generate_sched_domains
  cpuset: separate generate_sched_domains for v1 and v2
  cpuset: move update_domain_attr_tree to cpuset_v1.c
  cpuset: add cpuset1_init helper for v1 initialization
  cpuset: add cpuset1_online_css helper for v1-specific operations
  cpuset: add lockdep_assert_cpuset_lock_held helper
  cpuset: Remove unnecessary checks in rebuild_sched_domains_locked
  cgroup: switch to css_is_online() helper
  selftests: cgroup: Replace sleep with cg_read_key_long_poll() for waiting on nr_dying_descendants
  selftests: cgroup: make test_memcg_sock robust against delayed sock stats
  ...
This commit is contained in:
Linus Torvalds
2026-02-11 13:20:50 -08:00
20 changed files with 595 additions and 477 deletions

View File

@@ -737,9 +737,6 @@ combinations are invalid and should be rejected. Also, if the
resource is mandatory for execution of processes, process migrations
may be rejected.
"cpu.rt.max" hard-allocates realtime slices and is an example of this
type.
Interface Files
===============
@@ -2561,10 +2558,10 @@ Cpuset Interface Files
Users can manually set it to a value that is different from
"cpuset.cpus". One constraint in setting it is that the list of
CPUs must be exclusive with respect to "cpuset.cpus.exclusive"
of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup
isn't set, its "cpuset.cpus" value, if set, cannot be a subset
of it to leave at least one CPU available when the exclusive
CPUs are taken away.
and "cpuset.cpus.exclusive.effective" of its siblings. Another
constraint is that it cannot be a superset of "cpuset.cpus"
of its sibling in order to leave at least one CPU available to
that sibling when the exclusive CPUs are taken away.
For a parent cgroup, any one of its exclusive CPUs can only
be distributed to at most one of its child cgroups. Having an
@@ -2584,9 +2581,9 @@ Cpuset Interface Files
of this file will always be a subset of its parent's
"cpuset.cpus.exclusive.effective" if its parent is not the root
cgroup. It will also be a subset of "cpuset.cpus.exclusive"
if it is set. If "cpuset.cpus.exclusive" is not set, it is
treated to have an implicit value of "cpuset.cpus" in the
formation of local partition.
if it is set. This file should only be non-empty if either
"cpuset.cpus.exclusive" is set or when the current cpuset is
a valid partition root.
cpuset.cpus.isolated
A read-only and root cgroup only multiple values file.
@@ -2618,13 +2615,22 @@ Cpuset Interface Files
There are two types of partitions - local and remote. A local
partition is one whose parent cgroup is also a valid partition
root. A remote partition is one whose parent cgroup is not a
valid partition root itself. Writing to "cpuset.cpus.exclusive"
is optional for the creation of a local partition as its
"cpuset.cpus.exclusive" file will assume an implicit value that
is the same as "cpuset.cpus" if it is not set. Writing the
proper "cpuset.cpus.exclusive" values down the cgroup hierarchy
before the target partition root is mandatory for the creation
of a remote partition.
valid partition root itself.
Writing to "cpuset.cpus.exclusive" is optional for the creation
of a local partition as its "cpuset.cpus.exclusive" file will
assume an implicit value that is the same as "cpuset.cpus" if it
is not set. Writing the proper "cpuset.cpus.exclusive" values
down the cgroup hierarchy before the target partition root is
mandatory for the creation of a remote partition.
Not all the CPUs requested in "cpuset.cpus.exclusive" can be
used to form a new partition. Only those that were present
in its parent's "cpuset.cpus.exclusive.effective" control
file can be used. For partitions created without setting
"cpuset.cpus.exclusive", exclusive CPUs specified in sibling's
"cpuset.cpus.exclusive" or "cpuset.cpus.exclusive.effective"
also cannot be used.
Currently, a remote partition cannot be created under a local
partition. All the ancestors of a remote partition root except
@@ -2632,6 +2638,10 @@ Cpuset Interface Files
The root cgroup is always a partition root and its state cannot
be changed. All other non-root cgroups start out as "member".
Even though the "cpuset.cpus.exclusive*" and "cpuset.cpus"
control files are not present in the root cgroup, they are
implicitly the same as the "/sys/devices/system/cpu/possible"
sysfs file.
When set to "root", the current cgroup is the root of a new
partition or scheduling domain. The set of exclusive CPUs is

View File

@@ -981,7 +981,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio
css = mem_cgroup_css_from_folio(folio);
/* dead cgroups shouldn't contribute to inode ownership arbitration */
if (!(css->flags & CSS_ONLINE))
if (!css_is_online(css))
return;
id = css->id;

View File

@@ -535,10 +535,10 @@ struct cgroup {
* one which may have more subsystems enabled. Controller knobs
* are made available iff it's enabled in ->subtree_control.
*/
u16 subtree_control;
u16 subtree_ss_mask;
u16 old_subtree_control;
u16 old_subtree_ss_mask;
u32 subtree_control;
u32 subtree_ss_mask;
u32 old_subtree_control;
u32 old_subtree_ss_mask;
/* Private pointers for each registered subsystem */
struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];

View File

@@ -76,6 +76,7 @@ extern void inc_dl_tasks_cs(struct task_struct *task);
extern void dec_dl_tasks_cs(struct task_struct *task);
extern void cpuset_lock(void);
extern void cpuset_unlock(void);
extern void lockdep_assert_cpuset_lock_held(void);
extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
@@ -196,6 +197,7 @@ static inline void inc_dl_tasks_cs(struct task_struct *task) { }
static inline void dec_dl_tasks_cs(struct task_struct *task) { }
static inline void cpuset_lock(void) { }
static inline void cpuset_unlock(void) { }
static inline void lockdep_assert_cpuset_lock_held(void) { }
static inline void cpuset_cpus_allowed_locked(struct task_struct *p,
struct cpumask *mask)

View File

@@ -893,7 +893,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
{
if (mem_cgroup_disabled())
return true;
return !!(memcg->css.flags & CSS_ONLINE);
return css_is_online(&memcg->css);
}
void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,

View File

@@ -16,7 +16,7 @@ DECLARE_EVENT_CLASS(cgroup_root,
TP_STRUCT__entry(
__field( int, root )
__field( u16, ss_mask )
__field( u32, ss_mask )
__string( name, root->name )
),

View File

@@ -52,7 +52,7 @@ struct cgroup_fs_context {
bool cpuset_clone_children;
bool none; /* User explicitly requested empty subsystem */
bool all_ss; /* Seen 'all' option */
u16 subsys_mask; /* Selected subsystems */
u32 subsys_mask; /* Selected subsystems */
char *name; /* Hierarchy name */
char *release_agent; /* Path for release notifications */
};
@@ -146,7 +146,7 @@ struct cgroup_mgctx {
struct cgroup_taskset tset;
/* subsystems affected by migration */
u16 ss_mask;
u32 ss_mask;
};
#define CGROUP_TASKSET_INIT(tset) \
@@ -235,8 +235,8 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
void cgroup_favor_dynmods(struct cgroup_root *root, bool favor);
void cgroup_free_root(struct cgroup_root *root);
void init_cgroup_root(struct cgroup_fs_context *ctx);
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
int cgroup_setup_root(struct cgroup_root *root, u32 ss_mask);
int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask);
int cgroup_do_get_tree(struct fs_context *fc);
int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);

View File

@@ -28,7 +28,7 @@
#define CGROUP_PIDLIST_DESTROY_DELAY HZ
/* Controllers blocked by the commandline in v1 */
static u16 cgroup_no_v1_mask;
static u32 cgroup_no_v1_mask;
/* disable named v1 mounts */
static bool cgroup_no_v1_named;
@@ -1037,13 +1037,13 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
static int check_cgroupfs_options(struct fs_context *fc)
{
struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
u16 mask = U16_MAX;
u16 enabled = 0;
u32 mask = U32_MAX;
u32 enabled = 0;
struct cgroup_subsys *ss;
int i;
#ifdef CONFIG_CPUSETS
mask = ~((u16)1 << cpuset_cgrp_id);
mask = ~((u32)1 << cpuset_cgrp_id);
#endif
for_each_subsys(ss, i)
if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i) &&
@@ -1095,7 +1095,7 @@ int cgroup1_reconfigure(struct fs_context *fc)
struct kernfs_root *kf_root = kernfs_root_from_sb(fc->root->d_sb);
struct cgroup_root *root = cgroup_root_from_kf(kf_root);
int ret = 0;
u16 added_mask, removed_mask;
u32 added_mask, removed_mask;
cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
@@ -1343,7 +1343,7 @@ static int __init cgroup_no_v1(char *str)
continue;
if (!strcmp(token, "all")) {
cgroup_no_v1_mask = U16_MAX;
cgroup_no_v1_mask = U32_MAX;
continue;
}

View File

@@ -203,13 +203,13 @@ EXPORT_SYMBOL_GPL(cgrp_dfl_root);
bool cgrp_dfl_visible;
/* some controllers are not supported in the default hierarchy */
static u16 cgrp_dfl_inhibit_ss_mask;
static u32 cgrp_dfl_inhibit_ss_mask;
/* some controllers are implicitly enabled on the default hierarchy */
static u16 cgrp_dfl_implicit_ss_mask;
static u32 cgrp_dfl_implicit_ss_mask;
/* some controllers can be threaded on the default hierarchy */
static u16 cgrp_dfl_threaded_ss_mask;
static u32 cgrp_dfl_threaded_ss_mask;
/* The list of hierarchy roots */
LIST_HEAD(cgroup_roots);
@@ -231,10 +231,10 @@ static u64 css_serial_nr_next = 1;
* These bitmasks identify subsystems with specific features to avoid
* having to do iterative checks repeatedly.
*/
static u16 have_fork_callback __read_mostly;
static u16 have_exit_callback __read_mostly;
static u16 have_release_callback __read_mostly;
static u16 have_canfork_callback __read_mostly;
static u32 have_fork_callback __read_mostly;
static u32 have_exit_callback __read_mostly;
static u32 have_release_callback __read_mostly;
static u32 have_canfork_callback __read_mostly;
static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
@@ -472,13 +472,13 @@ static bool cgroup_is_valid_domain(struct cgroup *cgrp)
}
/* subsystems visibly enabled on a cgroup */
static u16 cgroup_control(struct cgroup *cgrp)
static u32 cgroup_control(struct cgroup *cgrp)
{
struct cgroup *parent = cgroup_parent(cgrp);
u16 root_ss_mask = cgrp->root->subsys_mask;
u32 root_ss_mask = cgrp->root->subsys_mask;
if (parent) {
u16 ss_mask = parent->subtree_control;
u32 ss_mask = parent->subtree_control;
/* threaded cgroups can only have threaded controllers */
if (cgroup_is_threaded(cgrp))
@@ -493,12 +493,12 @@ static u16 cgroup_control(struct cgroup *cgrp)
}
/* subsystems enabled on a cgroup */
static u16 cgroup_ss_mask(struct cgroup *cgrp)
static u32 cgroup_ss_mask(struct cgroup *cgrp)
{
struct cgroup *parent = cgroup_parent(cgrp);
if (parent) {
u16 ss_mask = parent->subtree_ss_mask;
u32 ss_mask = parent->subtree_ss_mask;
/* threaded cgroups can only have threaded controllers */
if (cgroup_is_threaded(cgrp))
@@ -1633,9 +1633,9 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
* This function calculates which subsystems need to be enabled if
* @subtree_control is to be applied while restricted to @this_ss_mask.
*/
static u16 cgroup_calc_subtree_ss_mask(u16 subtree_control, u16 this_ss_mask)
static u32 cgroup_calc_subtree_ss_mask(u32 subtree_control, u32 this_ss_mask)
{
u16 cur_ss_mask = subtree_control;
u32 cur_ss_mask = subtree_control;
struct cgroup_subsys *ss;
int ssid;
@@ -1644,7 +1644,7 @@ static u16 cgroup_calc_subtree_ss_mask(u16 subtree_control, u16 this_ss_mask)
cur_ss_mask |= cgrp_dfl_implicit_ss_mask;
while (true) {
u16 new_ss_mask = cur_ss_mask;
u32 new_ss_mask = cur_ss_mask;
do_each_subsys_mask(ss, ssid, cur_ss_mask) {
new_ss_mask |= ss->depends_on;
@@ -1848,12 +1848,12 @@ err:
return ret;
}
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask)
{
struct cgroup *dcgrp = &dst_root->cgrp;
struct cgroup_subsys *ss;
int ssid, ret;
u16 dfl_disable_ss_mask = 0;
u32 dfl_disable_ss_mask = 0;
lockdep_assert_held(&cgroup_mutex);
@@ -2149,7 +2149,7 @@ void init_cgroup_root(struct cgroup_fs_context *ctx)
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
int cgroup_setup_root(struct cgroup_root *root, u32 ss_mask)
{
LIST_HEAD(tmp_links);
struct cgroup *root_cgrp = &root->cgrp;
@@ -3131,7 +3131,7 @@ void cgroup_procs_write_finish(struct task_struct *task,
put_task_struct(task);
}
static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
static void cgroup_print_ss_mask(struct seq_file *seq, u32 ss_mask)
{
struct cgroup_subsys *ss;
bool printed = false;
@@ -3496,9 +3496,9 @@ static void cgroup_finalize_control(struct cgroup *cgrp, int ret)
cgroup_apply_control_disable(cgrp);
}
static int cgroup_vet_subtree_control_enable(struct cgroup *cgrp, u16 enable)
static int cgroup_vet_subtree_control_enable(struct cgroup *cgrp, u32 enable)
{
u16 domain_enable = enable & ~cgrp_dfl_threaded_ss_mask;
u32 domain_enable = enable & ~cgrp_dfl_threaded_ss_mask;
/* if nothing is getting enabled, nothing to worry about */
if (!enable)
@@ -3541,7 +3541,7 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
u16 enable = 0, disable = 0;
u32 enable = 0, disable = 0;
struct cgroup *cgrp, *child;
struct cgroup_subsys *ss;
char *tok;
@@ -4945,7 +4945,7 @@ bool css_has_online_children(struct cgroup_subsys_state *css)
rcu_read_lock();
css_for_each_child(child, css) {
if (child->flags & CSS_ONLINE) {
if (css_is_online(child)) {
ret = true;
break;
}
@@ -5750,7 +5750,7 @@ static void offline_css(struct cgroup_subsys_state *css)
lockdep_assert_held(&cgroup_mutex);
if (!(css->flags & CSS_ONLINE))
if (!css_is_online(css))
return;
if (ss->css_offline)
@@ -6347,7 +6347,7 @@ int __init cgroup_init(void)
struct cgroup_subsys *ss;
int ssid;
BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16);
BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 32);
BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_psi_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));

View File

@@ -9,6 +9,7 @@
#include <linux/cpuset.h>
#include <linux/spinlock.h>
#include <linux/union_find.h>
#include <linux/sched/isolation.h>
/* See "Frequency meter" comments, below. */
@@ -144,17 +145,12 @@ struct cpuset {
*/
nodemask_t old_mems_allowed;
struct fmeter fmeter; /* memory_pressure filter */
/*
* Tasks are being attached to this cpuset. Used to prevent
* zeroing cpus/mems_allowed between ->can_attach() and ->attach().
*/
int attach_in_progress;
/* for custom sched domain */
int relax_domain_level;
/* partition root state */
int partition_root_state;
@@ -179,10 +175,19 @@ struct cpuset {
/* Handle for cpuset.cpus.partition */
struct cgroup_file partition_file;
#ifdef CONFIG_CPUSETS_V1
struct fmeter fmeter; /* memory_pressure filter */
/* for custom sched domain */
int relax_domain_level;
/* Used to merge intersecting subsets for generate_sched_domains */
struct uf_node node;
#endif
};
extern struct cpuset top_cpuset;
static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
{
return css ? container_of(css, struct cpuset, css) : NULL;
@@ -240,6 +245,30 @@ static inline int is_spread_slab(const struct cpuset *cs)
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}
/*
* Helper routine for generate_sched_domains().
* Do cpusets a, b have overlapping effective cpus_allowed masks?
*/
static inline int cpusets_overlap(struct cpuset *a, struct cpuset *b)
{
return cpumask_intersects(a->effective_cpus, b->effective_cpus);
}
static inline int nr_cpusets(void)
{
/* jump label reference count + the top-level cpuset */
return static_key_count(&cpusets_enabled_key.key) + 1;
}
static inline bool cpuset_is_populated(struct cpuset *cs)
{
lockdep_assert_cpuset_lock_held();
/* Cpusets in the process of attaching should be considered as populated */
return cgroup_is_populated(cs->css.cgroup) ||
cs->attach_in_progress;
}
/**
* cpuset_for_each_child - traverse online children of a cpuset
* @child_cs: loop cursor pointing to the current child
@@ -285,7 +314,6 @@ void cpuset_full_unlock(void);
*/
#ifdef CONFIG_CPUSETS_V1
extern struct cftype cpuset1_files[];
void fmeter_init(struct fmeter *fmp);
void cpuset1_update_task_spread_flags(struct cpuset *cs,
struct task_struct *tsk);
void cpuset1_update_tasks_flags(struct cpuset *cs);
@@ -293,8 +321,13 @@ void cpuset1_hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated);
int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial);
bool cpuset1_cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2);
void cpuset1_init(struct cpuset *cs);
void cpuset1_online_css(struct cgroup_subsys_state *css);
int cpuset1_generate_sched_domains(cpumask_var_t **domains,
struct sched_domain_attr **attributes);
#else
static inline void fmeter_init(struct fmeter *fmp) {}
static inline void cpuset1_update_task_spread_flags(struct cpuset *cs,
struct task_struct *tsk) {}
static inline void cpuset1_update_tasks_flags(struct cpuset *cs) {}
@@ -303,6 +336,13 @@ static inline void cpuset1_hotplug_update_tasks(struct cpuset *cs,
bool cpus_updated, bool mems_updated) {}
static inline int cpuset1_validate_change(struct cpuset *cur,
struct cpuset *trial) { return 0; }
static inline bool cpuset1_cpus_excl_conflict(struct cpuset *cs1,
struct cpuset *cs2) { return false; }
static inline void cpuset1_init(struct cpuset *cs) {}
static inline void cpuset1_online_css(struct cgroup_subsys_state *css) {}
static inline int cpuset1_generate_sched_domains(cpumask_var_t **domains,
struct sched_domain_attr **attributes) { return 0; };
#endif /* CONFIG_CPUSETS_V1 */
#endif /* __CPUSET_INTERNAL_H */

View File

@@ -62,7 +62,7 @@ struct cpuset_remove_tasks_struct {
#define FM_SCALE 1000 /* faux fixed point scale */
/* Initialize a frequency meter */
void fmeter_init(struct fmeter *fmp)
static void fmeter_init(struct fmeter *fmp)
{
fmp->cnt = 0;
fmp->val = 0;
@@ -368,11 +368,44 @@ int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial)
if (par && !is_cpuset_subset(trial, par))
goto out;
/*
* Cpusets with tasks - existing or newly being attached - can't
* be changed to have empty cpus_allowed or mems_allowed.
*/
ret = -ENOSPC;
if (cpuset_is_populated(cur)) {
if (!cpumask_empty(cur->cpus_allowed) &&
cpumask_empty(trial->cpus_allowed))
goto out;
if (!nodes_empty(cur->mems_allowed) &&
nodes_empty(trial->mems_allowed))
goto out;
}
ret = 0;
out:
return ret;
}
/*
* cpuset1_cpus_excl_conflict() - Check if two cpusets have exclusive CPU conflicts
* to legacy (v1)
* @cs1: first cpuset to check
* @cs2: second cpuset to check
*
* Returns: true if CPU exclusivity conflict exists, false otherwise
*
* If either cpuset is CPU exclusive, their allowed CPUs cannot intersect.
*/
bool cpuset1_cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2)
{
if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2))
return cpumask_intersects(cs1->cpus_allowed,
cs2->cpus_allowed);
return false;
}
#ifdef CONFIG_PROC_PID_CPUSET
/*
* proc_cpuset_show()
@@ -499,6 +532,242 @@ out_unlock:
return retval;
}
void cpuset1_init(struct cpuset *cs)
{
fmeter_init(&cs->fmeter);
cs->relax_domain_level = -1;
}
void cpuset1_online_css(struct cgroup_subsys_state *css)
{
struct cpuset *tmp_cs;
struct cgroup_subsys_state *pos_css;
struct cpuset *cs = css_cs(css);
struct cpuset *parent = parent_cs(cs);
lockdep_assert_cpus_held();
lockdep_assert_cpuset_lock_held();
if (is_spread_page(parent))
set_bit(CS_SPREAD_PAGE, &cs->flags);
if (is_spread_slab(parent))
set_bit(CS_SPREAD_SLAB, &cs->flags);
if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
return;
/*
* Clone @parent's configuration if CGRP_CPUSET_CLONE_CHILDREN is
* set. This flag handling is implemented in cgroup core for
* historical reasons - the flag may be specified during mount.
*
* Currently, if any sibling cpusets have exclusive cpus or mem, we
* refuse to clone the configuration - thereby refusing the task to
* be entered, and as a result refusing the sys_unshare() or
* clone() which initiated it. If this becomes a problem for some
* users who wish to allow that scenario, then this could be
* changed to grant parent->cpus_allowed-sibling_cpus_exclusive
* (and likewise for mems) to the new cgroup.
*/
rcu_read_lock();
cpuset_for_each_child(tmp_cs, pos_css, parent) {
if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs)) {
rcu_read_unlock();
return;
}
}
rcu_read_unlock();
cpuset_callback_lock_irq();
cs->mems_allowed = parent->mems_allowed;
cs->effective_mems = parent->mems_allowed;
cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
cpuset_callback_unlock_irq();
}
static void
update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c)
{
if (dattr->relax_domain_level < c->relax_domain_level)
dattr->relax_domain_level = c->relax_domain_level;
}
static void update_domain_attr_tree(struct sched_domain_attr *dattr,
struct cpuset *root_cs)
{
struct cpuset *cp;
struct cgroup_subsys_state *pos_css;
rcu_read_lock();
cpuset_for_each_descendant_pre(cp, pos_css, root_cs) {
/* skip the whole subtree if @cp doesn't have any CPU */
if (cpumask_empty(cp->cpus_allowed)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
}
if (is_sched_load_balance(cp))
update_domain_attr(dattr, cp);
}
rcu_read_unlock();
}
/*
* cpuset1_generate_sched_domains()
*
* Finding the best partition (set of domains):
* The double nested loops below over i, j scan over the load
* balanced cpusets (using the array of cpuset pointers in csa[])
* looking for pairs of cpusets that have overlapping cpus_allowed
* and merging them using a union-find algorithm.
*
* The union of the cpus_allowed masks from the set of all cpusets
* having the same root then form the one element of the partition
* (one sched domain) to be passed to partition_sched_domains().
*/
int cpuset1_generate_sched_domains(cpumask_var_t **domains,
struct sched_domain_attr **attributes)
{
struct cpuset *cp; /* top-down scan of cpusets */
struct cpuset **csa; /* array of all cpuset ptrs */
int csn; /* how many cpuset ptrs in csa so far */
int i, j; /* indices for partition finding loops */
cpumask_var_t *doms; /* resulting partition; i.e. sched domains */
struct sched_domain_attr *dattr; /* attributes for custom domains */
int ndoms = 0; /* number of sched domains in result */
int nslot; /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
int nslot_update;
lockdep_assert_cpuset_lock_held();
doms = NULL;
dattr = NULL;
csa = NULL;
/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(&top_cpuset)) {
ndoms = 1;
doms = alloc_sched_domains(ndoms);
if (!doms)
goto done;
dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL);
if (dattr) {
*dattr = SD_ATTR_INIT;
update_domain_attr_tree(dattr, &top_cpuset);
}
cpumask_and(doms[0], top_cpuset.effective_cpus,
housekeeping_cpumask(HK_TYPE_DOMAIN));
goto done;
}
csa = kmalloc_array(nr_cpusets(), sizeof(cp), GFP_KERNEL);
if (!csa)
goto done;
csn = 0;
rcu_read_lock();
cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
if (cp == &top_cpuset)
continue;
/*
* Continue traversing beyond @cp iff @cp has some CPUs and
* isn't load balancing. The former is obvious. The
* latter: All child cpusets contain a subset of the
* parent's cpus, so just skip them, and then we call
* update_domain_attr_tree() to calc relax_domain_level of
* the corresponding sched domain.
*/
if (!cpumask_empty(cp->cpus_allowed) &&
!(is_sched_load_balance(cp) &&
cpumask_intersects(cp->cpus_allowed,
housekeeping_cpumask(HK_TYPE_DOMAIN))))
continue;
if (is_sched_load_balance(cp) &&
!cpumask_empty(cp->effective_cpus))
csa[csn++] = cp;
/* skip @cp's subtree */
pos_css = css_rightmost_descendant(pos_css);
continue;
}
rcu_read_unlock();
for (i = 0; i < csn; i++)
uf_node_init(&csa[i]->node);
/* Merge overlapping cpusets */
for (i = 0; i < csn; i++) {
for (j = i + 1; j < csn; j++) {
if (cpusets_overlap(csa[i], csa[j]))
uf_union(&csa[i]->node, &csa[j]->node);
}
}
/* Count the total number of domains */
for (i = 0; i < csn; i++) {
if (uf_find(&csa[i]->node) == &csa[i]->node)
ndoms++;
}
/*
* Now we know how many domains to create.
* Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
*/
doms = alloc_sched_domains(ndoms);
if (!doms)
goto done;
/*
* The rest of the code, including the scheduler, can deal with
* dattr==NULL case. No need to abort if alloc fails.
*/
dattr = kmalloc_array(ndoms, sizeof(struct sched_domain_attr),
GFP_KERNEL);
for (nslot = 0, i = 0; i < csn; i++) {
nslot_update = 0;
for (j = i; j < csn; j++) {
if (uf_find(&csa[j]->node) == &csa[i]->node) {
struct cpumask *dp = doms[nslot];
if (i == j) {
nslot_update = 1;
cpumask_clear(dp);
if (dattr)
*(dattr + nslot) = SD_ATTR_INIT;
}
cpumask_or(dp, dp, csa[j]->effective_cpus);
cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
if (dattr)
update_domain_attr_tree(dattr + nslot, csa[j]);
}
}
if (nslot_update)
nslot++;
}
BUG_ON(nslot != ndoms);
done:
kfree(csa);
/*
* Fallback to the default domain if kmalloc() failed.
* See comments in partition_sched_domains().
*/
if (doms == NULL)
ndoms = 1;
*domains = doms;
*attributes = dattr;
return ndoms;
}
/*
* for the common functions, 'private' gives the type of file
*/

View File

@@ -119,6 +119,17 @@ static bool force_sd_rebuild;
* For simplicity, a local partition can be created under a local or remote
* partition but a remote partition cannot have any partition root in its
* ancestor chain except the cgroup root.
*
* A valid partition can be formed by setting exclusive_cpus or cpus_allowed
* if exclusive_cpus is not set. In the case of partition with empty
* exclusive_cpus, all the conflicting exclusive CPUs specified in the
* following cpumasks of sibling cpusets will be removed from its
* cpus_allowed in determining its effective_xcpus.
* - effective_xcpus
* - exclusive_cpus
*
* The "cpuset.cpus.exclusive" control file should be used for setting up
* partition if the users want to get as many CPUs as possible.
*/
#define PRS_MEMBER 0
#define PRS_ROOT 1
@@ -201,12 +212,10 @@ static inline void notify_partition_change(struct cpuset *cs, int old_prs)
* If cpu_online_mask is used while a hotunplug operation is happening in
* parallel, we may leave an offline CPU in cpu_allowed or some other masks.
*/
static struct cpuset top_cpuset = {
struct cpuset top_cpuset = {
.flags = BIT(CS_CPU_EXCLUSIVE) |
BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE),
.partition_root_state = PRS_ROOT,
.relax_domain_level = -1,
.remote_partition = false,
};
/*
@@ -261,6 +270,11 @@ void cpuset_unlock(void)
mutex_unlock(&cpuset_mutex);
}
void lockdep_assert_cpuset_lock_held(void)
{
lockdep_assert_held(&cpuset_mutex);
}
/**
* cpuset_full_lock - Acquire full protection for cpuset modification
*
@@ -319,7 +333,7 @@ static inline void check_insane_mems_config(nodemask_t *nodes)
*/
static inline void dec_attach_in_progress_locked(struct cpuset *cs)
{
lockdep_assert_held(&cpuset_mutex);
lockdep_assert_cpuset_lock_held();
cs->attach_in_progress--;
if (!cs->attach_in_progress)
@@ -353,15 +367,6 @@ static inline bool is_in_v2_mode(void)
(cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
}
static inline bool cpuset_is_populated(struct cpuset *cs)
{
lockdep_assert_held(&cpuset_mutex);
/* Cpusets in the process of attaching should be considered as populated */
return cgroup_is_populated(cs->css.cgroup) ||
cs->attach_in_progress;
}
/**
* partition_is_populated - check if partition has tasks
* @cs: partition root to be checked
@@ -603,36 +608,32 @@ static inline bool cpusets_are_exclusive(struct cpuset *cs1, struct cpuset *cs2)
/**
* cpus_excl_conflict - Check if two cpusets have exclusive CPU conflicts
* @cs1: first cpuset to check
* @cs2: second cpuset to check
* @trial: the trial cpuset to be checked
* @sibling: a sibling cpuset to be checked against
* @xcpus_changed: set if exclusive_cpus has been set
*
* Returns: true if CPU exclusivity conflict exists, false otherwise
*
* Conflict detection rules:
* 1. If either cpuset is CPU exclusive, they must be mutually exclusive
* 2. exclusive_cpus masks cannot intersect between cpusets
* 3. The allowed CPUs of one cpuset cannot be a subset of another's exclusive CPUs
* o cgroup v1
* See cpuset1_cpus_excl_conflict()
* o cgroup v2
* - The exclusive_cpus values cannot overlap.
* - New exclusive_cpus cannot be a superset of a sibling's cpus_allowed.
*/
static inline bool cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2)
static inline bool cpus_excl_conflict(struct cpuset *trial, struct cpuset *sibling,
bool xcpus_changed)
{
/* If either cpuset is exclusive, check if they are mutually exclusive */
if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2))
return !cpusets_are_exclusive(cs1, cs2);
if (!cpuset_v2())
return cpuset1_cpus_excl_conflict(trial, sibling);
/* The cpus_allowed of a sibling cpuset cannot be a subset of the new exclusive_cpus */
if (xcpus_changed && !cpumask_empty(sibling->cpus_allowed) &&
cpumask_subset(sibling->cpus_allowed, trial->exclusive_cpus))
return true;
/* Exclusive_cpus cannot intersect */
if (cpumask_intersects(cs1->exclusive_cpus, cs2->exclusive_cpus))
return true;
/* The cpus_allowed of one cpuset cannot be a subset of another cpuset's exclusive_cpus */
if (!cpumask_empty(cs1->cpus_allowed) &&
cpumask_subset(cs1->cpus_allowed, cs2->exclusive_cpus))
return true;
if (!cpumask_empty(cs2->cpus_allowed) &&
cpumask_subset(cs2->cpus_allowed, cs1->exclusive_cpus))
return true;
return false;
return cpumask_intersects(trial->exclusive_cpus, sibling->exclusive_cpus);
}
static inline bool mems_excl_conflict(struct cpuset *cs1, struct cpuset *cs2)
@@ -666,6 +667,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
{
struct cgroup_subsys_state *css;
struct cpuset *c, *par;
bool xcpus_changed;
int ret = 0;
rcu_read_lock();
@@ -681,20 +683,6 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
par = parent_cs(cur);
/*
* Cpusets with tasks - existing or newly being attached - can't
* be changed to have empty cpus_allowed or mems_allowed.
*/
ret = -ENOSPC;
if (cpuset_is_populated(cur)) {
if (!cpumask_empty(cur->cpus_allowed) &&
cpumask_empty(trial->cpus_allowed))
goto out;
if (!nodes_empty(cur->mems_allowed) &&
nodes_empty(trial->mems_allowed))
goto out;
}
/*
* We can't shrink if we won't have enough room for SCHED_DEADLINE
* tasks. This check is not done when scheduling is disabled as the
@@ -722,10 +710,11 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
* overlap. exclusive_cpus cannot overlap with each other if set.
*/
ret = -EINVAL;
xcpus_changed = !cpumask_equal(cur->exclusive_cpus, trial->exclusive_cpus);
cpuset_for_each_child(c, css, par) {
if (c == cur)
continue;
if (cpus_excl_conflict(trial, c))
if (cpus_excl_conflict(trial, c, xcpus_changed))
goto out;
if (mems_excl_conflict(trial, c))
goto out;
@@ -738,49 +727,6 @@ out:
}
#ifdef CONFIG_SMP
/*
* Helper routine for generate_sched_domains().
* Do cpusets a, b have overlapping effective cpus_allowed masks?
*/
static int cpusets_overlap(struct cpuset *a, struct cpuset *b)
{
return cpumask_intersects(a->effective_cpus, b->effective_cpus);
}
static void
update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c)
{
if (dattr->relax_domain_level < c->relax_domain_level)
dattr->relax_domain_level = c->relax_domain_level;
return;
}
static void update_domain_attr_tree(struct sched_domain_attr *dattr,
struct cpuset *root_cs)
{
struct cpuset *cp;
struct cgroup_subsys_state *pos_css;
rcu_read_lock();
cpuset_for_each_descendant_pre(cp, pos_css, root_cs) {
/* skip the whole subtree if @cp doesn't have any CPU */
if (cpumask_empty(cp->cpus_allowed)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
}
if (is_sched_load_balance(cp))
update_domain_attr(dattr, cp);
}
rcu_read_unlock();
}
/* Must be called with cpuset_mutex held. */
static inline int nr_cpusets(void)
{
/* jump label reference count + the top-level cpuset */
return static_key_count(&cpusets_enabled_key.key) + 1;
}
/*
* generate_sched_domains()
@@ -820,103 +766,46 @@ static inline int nr_cpusets(void)
* convenient format, that can be easily compared to the prior
* value to determine what partition elements (sched domains)
* were changed (added or removed.)
*
* Finding the best partition (set of domains):
* The double nested loops below over i, j scan over the load
* balanced cpusets (using the array of cpuset pointers in csa[])
* looking for pairs of cpusets that have overlapping cpus_allowed
* and merging them using a union-find algorithm.
*
* The union of the cpus_allowed masks from the set of all cpusets
* having the same root then form the one element of the partition
* (one sched domain) to be passed to partition_sched_domains().
*
*/
static int generate_sched_domains(cpumask_var_t **domains,
struct sched_domain_attr **attributes)
{
struct cpuset *cp; /* top-down scan of cpusets */
struct cpuset **csa; /* array of all cpuset ptrs */
int csn; /* how many cpuset ptrs in csa so far */
int i, j; /* indices for partition finding loops */
cpumask_var_t *doms; /* resulting partition; i.e. sched domains */
struct sched_domain_attr *dattr; /* attributes for custom domains */
int ndoms = 0; /* number of sched domains in result */
int nslot; /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
bool root_load_balance = is_sched_load_balance(&top_cpuset);
bool cgrpv2 = cpuset_v2();
int nslot_update;
if (!cpuset_v2())
return cpuset1_generate_sched_domains(domains, attributes);
doms = NULL;
dattr = NULL;
csa = NULL;
/* Special case for the 99% of systems with one, full, sched domain */
if (root_load_balance && cpumask_empty(subpartitions_cpus)) {
single_root_domain:
if (cpumask_empty(subpartitions_cpus)) {
ndoms = 1;
doms = alloc_sched_domains(ndoms);
if (!doms)
goto done;
dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL);
if (dattr) {
*dattr = SD_ATTR_INIT;
update_domain_attr_tree(dattr, &top_cpuset);
}
cpumask_and(doms[0], top_cpuset.effective_cpus,
housekeeping_cpumask(HK_TYPE_DOMAIN));
goto done;
/* !csa will be checked and can be correctly handled */
goto generate_doms;
}
csa = kmalloc_array(nr_cpusets(), sizeof(cp), GFP_KERNEL);
if (!csa)
goto done;
csn = 0;
/* Find how many partitions and cache them to csa[] */
rcu_read_lock();
if (root_load_balance)
csa[csn++] = &top_cpuset;
cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
if (cp == &top_cpuset)
continue;
if (cgrpv2)
goto v2;
/*
* v1:
* Continue traversing beyond @cp iff @cp has some CPUs and
* isn't load balancing. The former is obvious. The
* latter: All child cpusets contain a subset of the
* parent's cpus, so just skip them, and then we call
* update_domain_attr_tree() to calc relax_domain_level of
* the corresponding sched domain.
*/
if (!cpumask_empty(cp->cpus_allowed) &&
!(is_sched_load_balance(cp) &&
cpumask_intersects(cp->cpus_allowed,
housekeeping_cpumask(HK_TYPE_DOMAIN))))
continue;
if (is_sched_load_balance(cp) &&
!cpumask_empty(cp->effective_cpus))
csa[csn++] = cp;
/* skip @cp's subtree */
pos_css = css_rightmost_descendant(pos_css);
continue;
v2:
/*
* Only valid partition roots that are not isolated and with
* non-empty effective_cpus will be saved into csn[].
* non-empty effective_cpus will be saved into csa[].
*/
if ((cp->partition_root_state == PRS_ROOT) &&
!cpumask_empty(cp->effective_cpus))
csa[csn++] = cp;
csa[ndoms++] = cp;
/*
* Skip @cp's subtree if not a partition root and has no
@@ -927,40 +816,18 @@ v2:
}
rcu_read_unlock();
/*
* If there are only isolated partitions underneath the cgroup root,
* we can optimize out unneeded sched domains scanning.
*/
if (root_load_balance && (csn == 1))
goto single_root_domain;
for (i = 0; i < csn; i++)
uf_node_init(&csa[i]->node);
/* Merge overlapping cpusets */
for (i = 0; i < csn; i++) {
for (j = i + 1; j < csn; j++) {
if (cpusets_overlap(csa[i], csa[j])) {
for (i = 0; i < ndoms; i++) {
for (j = i + 1; j < ndoms; j++) {
if (cpusets_overlap(csa[i], csa[j]))
/*
* Cgroup v2 shouldn't pass down overlapping
* partition root cpusets.
*/
WARN_ON_ONCE(cgrpv2);
uf_union(&csa[i]->node, &csa[j]->node);
}
WARN_ON_ONCE(1);
}
}
/* Count the total number of domains */
for (i = 0; i < csn; i++) {
if (uf_find(&csa[i]->node) == &csa[i]->node)
ndoms++;
}
/*
* Now we know how many domains to create.
* Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
*/
generate_doms:
doms = alloc_sched_domains(ndoms);
if (!doms)
goto done;
@@ -977,46 +844,20 @@ v2:
* to SD_ATTR_INIT. Also non-isolating partition root CPUs are a
* subset of HK_TYPE_DOMAIN housekeeping CPUs.
*/
if (cgrpv2) {
for (i = 0; i < ndoms; i++) {
/*
* The top cpuset may contain some boot time isolated
* CPUs that need to be excluded from the sched domain.
*/
if (csa[i] == &top_cpuset)
cpumask_and(doms[i], csa[i]->effective_cpus,
housekeeping_cpumask(HK_TYPE_DOMAIN));
else
cpumask_copy(doms[i], csa[i]->effective_cpus);
if (dattr)
dattr[i] = SD_ATTR_INIT;
}
goto done;
for (i = 0; i < ndoms; i++) {
/*
* The top cpuset may contain some boot time isolated
* CPUs that need to be excluded from the sched domain.
*/
if (!csa || csa[i] == &top_cpuset)
cpumask_and(doms[i], top_cpuset.effective_cpus,
housekeeping_cpumask(HK_TYPE_DOMAIN));
else
cpumask_copy(doms[i], csa[i]->effective_cpus);
if (dattr)
dattr[i] = SD_ATTR_INIT;
}
for (nslot = 0, i = 0; i < csn; i++) {
nslot_update = 0;
for (j = i; j < csn; j++) {
if (uf_find(&csa[j]->node) == &csa[i]->node) {
struct cpumask *dp = doms[nslot];
if (i == j) {
nslot_update = 1;
cpumask_clear(dp);
if (dattr)
*(dattr + nslot) = SD_ATTR_INIT;
}
cpumask_or(dp, dp, csa[j]->effective_cpus);
cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
if (dattr)
update_domain_attr_tree(dattr + nslot, csa[j]);
}
}
if (nslot_update)
nslot++;
}
BUG_ON(nslot != ndoms);
done:
kfree(csa);
@@ -1055,7 +896,7 @@ void dl_rebuild_rd_accounting(void)
int cpu;
u64 cookie = ++dl_cookie;
lockdep_assert_held(&cpuset_mutex);
lockdep_assert_cpuset_lock_held();
lockdep_assert_cpus_held();
lockdep_assert_held(&sched_domains_mutex);
@@ -1100,53 +941,33 @@ void dl_rebuild_rd_accounting(void)
*/
void rebuild_sched_domains_locked(void)
{
struct cgroup_subsys_state *pos_css;
struct sched_domain_attr *attr;
cpumask_var_t *doms;
struct cpuset *cs;
int ndoms;
int i;
lockdep_assert_cpus_held();
lockdep_assert_held(&cpuset_mutex);
lockdep_assert_cpuset_lock_held();
force_sd_rebuild = false;
/*
* If we have raced with CPU hotplug, return early to avoid
* passing doms with offlined cpu to partition_sched_domains().
* Anyways, cpuset_handle_hotplug() will rebuild sched domains.
*
* With no CPUs in any subpartitions, top_cpuset's effective CPUs
* should be the same as the active CPUs, so checking only top_cpuset
* is enough to detect racing CPU offlines.
*/
if (cpumask_empty(subpartitions_cpus) &&
!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
return;
/*
* With subpartition CPUs, however, the effective CPUs of a partition
* root should be only a subset of the active CPUs. Since a CPU in any
* partition root could be offlined, all must be checked.
*/
if (!cpumask_empty(subpartitions_cpus)) {
rcu_read_lock();
cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
if (!is_partition_valid(cs)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
}
if (!cpumask_subset(cs->effective_cpus,
cpu_active_mask)) {
rcu_read_unlock();
return;
}
}
rcu_read_unlock();
}
/* Generate domain masks and attrs */
ndoms = generate_sched_domains(&doms, &attr);
/*
* cpuset_hotplug_workfn is invoked synchronously now, thus this
* function should not race with CPU hotplug. And the effective CPUs
* must not include any offline CPUs. Passing an offline CPU in the
* doms to partition_sched_domains() will trigger a kernel panic.
*
* We perform a final check here: if the doms contains any
* offline CPUs, a warning is emitted and we return directly to
* prevent the panic.
*/
for (i = 0; i < ndoms; ++i) {
if (WARN_ON_ONCE(!cpumask_subset(doms[i], cpu_active_mask)))
return;
}
/* Have scheduler rebuild the domains */
partition_sched_domains(ndoms, doms, attr);
}
@@ -1501,23 +1322,29 @@ static int rm_siblings_excl_cpus(struct cpuset *parent, struct cpuset *cs,
int retval = 0;
if (cpumask_empty(excpus))
return retval;
return 0;
/*
* Exclude exclusive CPUs from siblings
* Remove exclusive CPUs from siblings
*/
rcu_read_lock();
cpuset_for_each_child(sibling, css, parent) {
struct cpumask *sibling_xcpus;
if (sibling == cs)
continue;
if (cpumask_intersects(excpus, sibling->exclusive_cpus)) {
cpumask_andnot(excpus, excpus, sibling->exclusive_cpus);
retval++;
continue;
}
if (cpumask_intersects(excpus, sibling->effective_xcpus)) {
cpumask_andnot(excpus, excpus, sibling->effective_xcpus);
/*
* If exclusive_cpus is defined, effective_xcpus will always
* be a subset. Otherwise, effective_xcpus will only be set
* in a valid partition root.
*/
sibling_xcpus = cpumask_empty(sibling->exclusive_cpus)
? sibling->effective_xcpus
: sibling->exclusive_cpus;
if (cpumask_intersects(excpus, sibling_xcpus)) {
cpumask_andnot(excpus, excpus, sibling_xcpus);
retval++;
}
}
@@ -1806,7 +1633,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
int parent_prs = parent->partition_root_state;
bool nocpu;
lockdep_assert_held(&cpuset_mutex);
lockdep_assert_cpuset_lock_held();
WARN_ON_ONCE(is_remote_partition(cs)); /* For local partition only */
/*
@@ -2315,17 +2142,13 @@ get_css:
spin_lock_irq(&callback_lock);
cpumask_copy(cp->effective_cpus, tmp->new_cpus);
cp->partition_root_state = new_prs;
if (!cpumask_empty(cp->exclusive_cpus) && (cp != cs))
compute_excpus(cp, cp->effective_xcpus);
/*
* Make sure effective_xcpus is properly set for a valid
* partition root.
* Need to compute effective_xcpus if either exclusive_cpus
* is non-empty or it is a valid partition root.
*/
if ((new_prs > 0) && cpumask_empty(cp->exclusive_cpus))
cpumask_and(cp->effective_xcpus,
cp->cpus_allowed, parent->effective_xcpus);
else if (new_prs < 0)
if ((new_prs > 0) || !cpumask_empty(cp->exclusive_cpus))
compute_excpus(cp, cp->effective_xcpus);
if (new_prs <= 0)
reset_partition_data(cp);
spin_unlock_irq(&callback_lock);
@@ -2378,7 +2201,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
struct cpuset *sibling;
struct cgroup_subsys_state *pos_css;
lockdep_assert_held(&cpuset_mutex);
lockdep_assert_cpuset_lock_held();
/*
* Check all its siblings and call update_cpumasks_hier()
@@ -2387,27 +2210,20 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
* It is possible a change in parent's effective_cpus
* due to a change in a child partition's effective_xcpus will impact
* its siblings even if they do not inherit parent's effective_cpus
* directly.
* directly. It should not impact valid partition.
*
* The update_cpumasks_hier() function may sleep. So we have to
* release the RCU read lock before calling it.
*/
rcu_read_lock();
cpuset_for_each_child(sibling, pos_css, parent) {
if (sibling == cs)
if (sibling == cs || is_partition_valid(sibling))
continue;
if (!is_partition_valid(sibling)) {
compute_effective_cpumask(tmp->new_cpus, sibling,
parent);
if (cpumask_equal(tmp->new_cpus, sibling->effective_cpus))
continue;
} else if (is_remote_partition(sibling)) {
/*
* Change in a sibling cpuset won't affect a remote
* partition root.
*/
compute_effective_cpumask(tmp->new_cpus, sibling,
parent);
if (cpumask_equal(tmp->new_cpus, sibling->effective_cpus))
continue;
}
if (!css_tryget_online(&sibling->css))
continue;
@@ -2463,43 +2279,6 @@ static enum prs_errcode validate_partition(struct cpuset *cs, struct cpuset *tri
return PERR_NONE;
}
static int cpus_allowed_validate_change(struct cpuset *cs, struct cpuset *trialcs,
struct tmpmasks *tmp)
{
int retval;
struct cpuset *parent = parent_cs(cs);
retval = validate_change(cs, trialcs);
if ((retval == -EINVAL) && cpuset_v2()) {
struct cgroup_subsys_state *css;
struct cpuset *cp;
/*
* The -EINVAL error code indicates that partition sibling
* CPU exclusivity rule has been violated. We still allow
* the cpumask change to proceed while invalidating the
* partition. However, any conflicting sibling partitions
* have to be marked as invalid too.
*/
trialcs->prs_err = PERR_NOTEXCL;
rcu_read_lock();
cpuset_for_each_child(cp, css, parent) {
struct cpumask *xcpus = user_xcpus(trialcs);
if (is_partition_valid(cp) &&
cpumask_intersects(xcpus, cp->effective_xcpus)) {
rcu_read_unlock();
update_parent_effective_cpumask(cp, partcmd_invalidate, NULL, tmp);
rcu_read_lock();
}
}
rcu_read_unlock();
retval = 0;
}
return retval;
}
/**
* partition_cpus_change - Handle partition state changes due to CPU mask updates
* @cs: The target cpuset being modified
@@ -2559,15 +2338,15 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed))
return 0;
if (alloc_tmpmasks(&tmp))
return -ENOMEM;
compute_trialcs_excpus(trialcs, cs);
trialcs->prs_err = PERR_NONE;
retval = cpus_allowed_validate_change(cs, trialcs, &tmp);
retval = validate_change(cs, trialcs);
if (retval < 0)
goto out_free;
return retval;
if (alloc_tmpmasks(&tmp))
return -ENOMEM;
/*
* Check all the descendants in update_cpumasks_hier() if
@@ -2590,7 +2369,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
/* Update CS_SCHED_LOAD_BALANCE and/or sched_domains, if necessary */
if (cs->partition_root_state)
update_partition_sd_lb(cs, old_prs);
out_free:
free_tmpmasks(&tmp);
return retval;
}
@@ -3249,7 +3028,7 @@ static nodemask_t cpuset_attach_nodemask_to;
static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
{
lockdep_assert_held(&cpuset_mutex);
lockdep_assert_cpuset_lock_held();
if (cs != &top_cpuset)
guarantee_active_cpus(task, cpus_attach);
@@ -3605,8 +3384,7 @@ cpuset_css_alloc(struct cgroup_subsys_state *parent_css)
return ERR_PTR(-ENOMEM);
__set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
fmeter_init(&cs->fmeter);
cs->relax_domain_level = -1;
cpuset1_init(cs);
/* Set CS_MEMORY_MIGRATE for default hierarchy */
if (cpuset_v2())
@@ -3619,17 +3397,11 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
{
struct cpuset *cs = css_cs(css);
struct cpuset *parent = parent_cs(cs);
struct cpuset *tmp_cs;
struct cgroup_subsys_state *pos_css;
if (!parent)
return 0;
cpuset_full_lock();
if (is_spread_page(parent))
set_bit(CS_SPREAD_PAGE, &cs->flags);
if (is_spread_slab(parent))
set_bit(CS_SPREAD_SLAB, &cs->flags);
/*
* For v2, clear CS_SCHED_LOAD_BALANCE if parent is isolated
*/
@@ -3644,39 +3416,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
cs->effective_mems = parent->effective_mems;
}
spin_unlock_irq(&callback_lock);
cpuset1_online_css(css);
if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
goto out_unlock;
/*
* Clone @parent's configuration if CGRP_CPUSET_CLONE_CHILDREN is
* set. This flag handling is implemented in cgroup core for
* historical reasons - the flag may be specified during mount.
*
* Currently, if any sibling cpusets have exclusive cpus or mem, we
* refuse to clone the configuration - thereby refusing the task to
* be entered, and as a result refusing the sys_unshare() or
* clone() which initiated it. If this becomes a problem for some
* users who wish to allow that scenario, then this could be
* changed to grant parent->cpus_allowed-sibling_cpus_exclusive
* (and likewise for mems) to the new cgroup.
*/
rcu_read_lock();
cpuset_for_each_child(tmp_cs, pos_css, parent) {
if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs)) {
rcu_read_unlock();
goto out_unlock;
}
}
rcu_read_unlock();
spin_lock_irq(&callback_lock);
cs->mems_allowed = parent->mems_allowed;
cs->effective_mems = parent->mems_allowed;
cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
spin_unlock_irq(&callback_lock);
out_unlock:
cpuset_full_unlock();
return 0;
}
@@ -3876,7 +3617,7 @@ int __init cpuset_init(void)
cpumask_setall(top_cpuset.exclusive_cpus);
nodes_setall(top_cpuset.effective_mems);
fmeter_init(&top_cpuset.fmeter);
cpuset1_init(&top_cpuset);
BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
@@ -4210,7 +3951,7 @@ static void __cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask
*/
void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask)
{
lockdep_assert_held(&cpuset_mutex);
lockdep_assert_cpuset_lock_held();
__cpuset_cpus_allowed_locked(tsk, pmask);
}

View File

@@ -230,7 +230,7 @@ static int cgroup_subsys_states_read(struct seq_file *seq, void *v)
}
static void cgroup_masks_read_one(struct seq_file *seq, const char *name,
u16 mask)
u32 mask)
{
struct cgroup_subsys *ss;
int ssid;

View File

@@ -283,7 +283,7 @@ ino_t page_cgroup_ino(struct page *page)
/* page_folio() is racy here, but the entire function is racy anyway */
memcg = folio_memcg_check(page_folio(page));
while (memcg && !(memcg->css.flags & CSS_ONLINE))
while (memcg && !css_is_online(&memcg->css))
memcg = parent_mem_cgroup(memcg);
if (memcg)
ino = cgroup_ino(memcg->css.cgroup);

View File

@@ -530,7 +530,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
if (!memcg)
goto out_unlock;
online = (memcg->css.flags & CSS_ONLINE);
online = css_is_online(&memcg->css);
cgroup_name(memcg->css.cgroup, name, sizeof(name));
ret += scnprintf(kbuf + ret, count - ret,
"Charged %sto %smemcg %s\n",

View File

@@ -168,6 +168,27 @@ long cg_read_key_long(const char *cgroup, const char *control, const char *key)
return atol(ptr + strlen(key));
}
long cg_read_key_long_poll(const char *cgroup, const char *control,
const char *key, long expected, int retries,
useconds_t wait_interval_us)
{
long val = -1;
int i;
for (i = 0; i < retries; i++) {
val = cg_read_key_long(cgroup, control, key);
if (val < 0)
return val;
if (val == expected)
break;
usleep(wait_interval_us);
}
return val;
}
long cg_read_lc(const char *cgroup, const char *control)
{
char buf[PAGE_SIZE];

View File

@@ -17,6 +17,8 @@
#define CG_NAMED_NAME "selftest"
#define CG_PATH_FORMAT (!cg_test_v1_named ? "0::%s" : (":name=" CG_NAMED_NAME ":%s"))
#define DEFAULT_WAIT_INTERVAL_US (100 * 1000) /* 100 ms */
/*
* Checks if two given values differ by less than err% of their sum.
*/
@@ -64,6 +66,9 @@ extern int cg_read_strstr(const char *cgroup, const char *control,
extern long cg_read_long(const char *cgroup, const char *control);
extern long cg_read_long_fd(int fd);
long cg_read_key_long(const char *cgroup, const char *control, const char *key);
long cg_read_key_long_poll(const char *cgroup, const char *control,
const char *key, long expected, int retries,
useconds_t wait_interval_us);
extern long cg_read_lc(const char *cgroup, const char *control);
extern int cg_write(const char *cgroup, const char *control, char *buf);
extern int cg_open(const char *cgroup, const char *control, int flags);

View File

@@ -269,7 +269,7 @@ TEST_MATRIX=(
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X3:P2 . . 0 A1:0-2|A2:3|A3:3 A1:P0|A2:P2 3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2 . 0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:C3 . 0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 C2-3 . . . P2 0 A1:0-3|A2:1-3|A3:2-3|B1:2-3 A1:P0|A3:P0|B1:P-2"
" C0-3:S+ C1-3:S+ C2-3 C2-3 . . . P2 0 A1:0-1|A2:1|A3:1|B1:2-3 A1:P0|A3:P0|B1:P2"
" C0-3:S+ C1-3:S+ C2-3 C4-5 . . . P2 0 B1:4-5 B1:P2 4-5"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2 P2 0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2:C1-3 P2 0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
@@ -318,7 +318,7 @@ TEST_MATRIX=(
# Invalid to valid local partition direct transition tests
" C1-3:S+:P2 X4:P2 . . . . . . 0 A1:1-3|XA1:1-3|A2:1-3:XA2: A1:P2|A2:P-2 1-3"
" C1-3:S+:P2 X4:P2 . . . X3:P2 . . 0 A1:1-2|XA1:1-3|A2:3:XA2:3 A1:P2|A2:P2 1-3"
" C0-3:P2 . . C4-6 C0-4 . . . 0 A1:0-4|B1:4-6 A1:P-2|B1:P0"
" C0-3:P2 . . C4-6 C0-4 . . . 0 A1:0-4|B1:5-6 A1:P2|B1:P0"
" C0-3:P2 . . C4-6 C0-4:C0-3 . . . 0 A1:0-3|B1:4-6 A1:P2|B1:P0 0-3"
# Local partition invalidation tests
@@ -388,10 +388,10 @@ TEST_MATRIX=(
" C0-1:S+ C1 . C2-3 . P2 . . 0 A1:0-1|A2:1 A1:P0|A2:P-2"
" C0-1:S+ C1:P2 . C2-3 P1 . . . 0 A1:0|A2:1 A1:P1|A2:P2 0-1|1"
# A non-exclusive cpuset.cpus change will invalidate partition and its siblings
" C0-1:P1 . . C2-3 C0-2 . . . 0 A1:0-2|B1:2-3 A1:P-1|B1:P0"
" C0-1:P1 . . P1:C2-3 C0-2 . . . 0 A1:0-2|B1:2-3 A1:P-1|B1:P-1"
" C0-1 . . P1:C2-3 C0-2 . . . 0 A1:0-2|B1:2-3 A1:P0|B1:P-1"
# A non-exclusive cpuset.cpus change will not invalidate its siblings partition.
" C0-1:P1 . . C2-3 C0-2 . . . 0 A1:0-2|B1:3 A1:P1|B1:P0"
" C0-1:P1 . . P1:C2-3 C0-2 . . . 0 A1:0-1|XA1:0-1|B1:2-3 A1:P1|B1:P1"
" C0-1 . . P1:C2-3 C0-2 . . . 0 A1:0-1|B1:2-3 A1:P0|B1:P1"
# cpuset.cpus can overlap with sibling cpuset.cpus.exclusive but not subsumed by it
" C0-3 . . C4-5 X5 . . . 0 A1:0-3|B1:4-5"
@@ -417,6 +417,17 @@ TEST_MATRIX=(
" CX1-4:S+ CX2-4:P2 . C5-6 . . . P1:C3-6 0 A1:1|A2:2-4|B1:5-6 \
A1:P0|A2:P2:B1:P-1 2-4"
# When multiple partitions with conflicting cpuset.cpus are created, the
# latter created ones will only get what are left of the available exclusive
# CPUs.
" C1-3:P1 . . . . . . C3-5:P1 0 A1:1-3|B1:4-5:XB1:4-5 A1:P1|B1:P1"
# cpuset.cpus can be set to a subset of sibling's cpuset.cpus.exclusive
" C1-3:X1-3 . . C4-5 . . . C1-2 0 A1:1-3|B1:1-2"
# cpuset.cpus can become empty with task in it as it inherits parent's effective CPUs
" C1-3:S+ C2 . . . T:C . . 0 A1:1-3|A2:1-3"
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
# ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
# Failure cases:
@@ -427,7 +438,7 @@ TEST_MATRIX=(
# Changes to cpuset.cpus.exclusive that violate exclusivity rule is rejected
" C0-3 . . C4-5 X0-3 . . X3-5 1 A1:0-3|B1:4-5"
# cpuset.cpus cannot be a subset of sibling cpuset.cpus.exclusive
# cpuset.cpus.exclusive cannot be set to a superset of sibling's cpuset.cpus
" C0-3 . . C4-5 X3-5 . . . 1 A1:0-3|B1:4-5"
)
@@ -477,6 +488,10 @@ REMOTE_TEST_MATRIX=(
. . X1-2:P2 X4-5:P1 . X1-7:P2 p1:3|c11:1-2|c12:4:c22:5-6 \
p1:P0|p2:P1|c11:P2|c12:P1|c22:P2 \
1-2,4-6|1-2,5-6"
# c12 whose cpuset.cpus CPUs are all granted to c11 will become invalid partition
" C1-5:P1:S+ . C1-4:P1 C2-3 . . \
. . . P1 . . p1:5|c11:1-4|c12:5 \
p1:P1|c11:P1|c12:P-1"
)
#

View File

@@ -26,6 +26,7 @@
*/
#define MAX_VMSTAT_ERROR (4096 * 64 * get_nprocs())
#define KMEM_DEAD_WAIT_RETRIES 80
static int alloc_dcache(const char *cgroup, void *arg)
{
@@ -306,9 +307,7 @@ static int test_kmem_dead_cgroups(const char *root)
{
int ret = KSFT_FAIL;
char *parent;
long dead;
int i;
int max_time = 20;
long dead = -1;
parent = cg_name(root, "kmem_dead_cgroups_test");
if (!parent)
@@ -323,21 +322,19 @@ static int test_kmem_dead_cgroups(const char *root)
if (cg_run_in_subcgroups(parent, alloc_dcache, (void *)100, 30))
goto cleanup;
for (i = 0; i < max_time; i++) {
dead = cg_read_key_long(parent, "cgroup.stat",
"nr_dying_descendants ");
if (dead == 0) {
ret = KSFT_PASS;
break;
}
/*
* Reclaiming cgroups might take some time,
* let's wait a bit and repeat.
*/
sleep(1);
if (i > 5)
printf("Waiting time longer than 5s; wait: %ds (dead: %ld)\n", i, dead);
}
/*
* Allow up to ~8s for reclaim of dying descendants to complete.
* This is a generous upper bound derived from stress testing, not
* from a specific kernel constant, and can be adjusted if reclaim
* behavior changes in the future.
*/
dead = cg_read_key_long_poll(parent, "cgroup.stat",
"nr_dying_descendants ", 0, KMEM_DEAD_WAIT_RETRIES,
DEFAULT_WAIT_INTERVAL_US);
if (dead)
goto cleanup;
ret = KSFT_PASS;
cleanup:
cg_destroy(parent);

View File

@@ -21,6 +21,8 @@
#include "kselftest.h"
#include "cgroup_util.h"
#define MEMCG_SOCKSTAT_WAIT_RETRIES 30
static bool has_localevents;
static bool has_recursiveprot;
@@ -1384,6 +1386,7 @@ static int test_memcg_sock(const char *root)
int bind_retries = 5, ret = KSFT_FAIL, pid, err;
unsigned short port;
char *memcg;
long sock_post = -1;
memcg = cg_name(root, "memcg_test");
if (!memcg)
@@ -1432,7 +1435,22 @@ static int test_memcg_sock(const char *root)
if (cg_read_long(memcg, "memory.current") < 0)
goto cleanup;
if (cg_read_key_long(memcg, "memory.stat", "sock "))
/*
* memory.stat is updated asynchronously via the memcg rstat
* flushing worker, which runs periodically (every 2 seconds,
* see FLUSH_TIME). On a busy system, the "sock " counter may
* stay non-zero for a short period of time after the TCP
* connection is closed and all socket memory has been
* uncharged.
*
* Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some
* scheduling slack) and require that the "sock " counter
* eventually drops to zero.
*/
sock_post = cg_read_key_long_poll(memcg, "memory.stat", "sock ", 0,
MEMCG_SOCKSTAT_WAIT_RETRIES,
DEFAULT_WAIT_INTERVAL_US);
if (sock_post)
goto cleanup;
ret = KSFT_PASS;