-
Notifications
You must be signed in to change notification settings - Fork 143
layered cpuset support #1747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
layered cpuset support #1747
Conversation
612f186
to
64f4056
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So AFAICT The changes are:
- Machinery for forwarding cpuset information to the BPF side
- The corresponding BPF code
- Logic that replicates allow_node_aligned but for cpusets.
If this fixes cpuset related problems I think it is reasonable, but I'm not sure about the naming - contianer enable is a bit confusing because containers aren't really a thing at this level of abstraction. Maybe replace "container" with "cpuset-based workloads"? This way it's clear what the code does concretely.
495915d
to
33cb330
Compare
33cb330
to
ed4c527
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have trouble understanding how this is supposed to work. Node aligned is easier because DSQs are LLC aligned and LLCs are node aligned. We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?
@@ -1381,9 +1399,11 @@ void BPF_STRUCT_OPS(layered_enqueue, struct task_struct *p, u64 enq_flags) | |||
* without making the whole scheduler node aware and should only be used | |||
* with open layers on non-saturated machines to avoid possible stalls. | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the comment to explain cpus_cpuset_aligned
.
@@ -2658,7 +2678,7 @@ static void refresh_cpus_flags(struct task_ctx *taskc, | |||
|
|||
if (!(nodec = lookup_node_ctx(node_id)) || | |||
!(node_cpumask = cast_mask(nodec->cpumask))) | |||
return; | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is scx_bpf_error()
condition. There's no point in continuing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made this an explicit error instead of just a return, thanks.
@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc, | |||
break; | |||
} | |||
} | |||
if (enable_cpuset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a blank line above?
@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc, | |||
break; | |||
} | |||
} | |||
if (enable_cpuset) { | |||
bpf_for(cpuset_id, 0, nr_cpusets) { | |||
struct cpumask_wrapper* wrapper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blank line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I figured out the whitespace convention (before if, before for, sometimes after for), but LMK if there's still places it's wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... yeah, on machines with single LLC nodes, this is fine, but if there are multiple LLCs per node, the current allow_node_aligned
condition can lead to unexpected behaviors as tasks can end up in a DSQ that a layer doesn't have CPUs on. But going back to this PR, if cpusets aren't aligned with LLCs, wouldn't we have a similar problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... yeah, on machines with single LLC nodes, this is fine, but if there are multiple LLCs per node, the current allow_node_aligned condition can lead to unexpected behaviors as tasks can end up in a DSQ that a layer doesn't have CPUs on.
Will fix that in a separate PR, thx.
But going back to this PR, if cpusets aren't aligned with LLCs, wouldn't we have a similar problem?
Yeah, I was just able to produce that kind of behavior w/ my test setup, gonna have to add something to deal with that.
} | ||
if (bpf_cpumask_equal(cast_mask(wrapper->mask), cpumask)) { | ||
taskc->cpus_cpuset_aligned = true; | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break;
here so that it's consistent with the node aligned block and the function can be extended in the future? Note that this would require moving the false
setting. BTW, why not use the same partial overlapping test used by node alignment test instead of equality test? Is that not sufficient for forward progress guarantee? If not, it'd probably be worthwhile to explain why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, correct me if I am wrong, that partial test is wrong/the node code should do what the cpuset code does?
For example:
1000 task_mask
1100 system_mask
system_mask intersect task_mask == true
system_mask subset task_mask == false
IIUC, allow node aligned is only supposed to avoid lo fb when task cpumask matches node cpumasks to ensure lo fb is used for tasks with single cpu affinities (i.e. which are stall prone)?
if (enable_cpuset) { | ||
bpf_for(i, 0, nr_cpusets) { | ||
cpumask = bpf_cpumask_create(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also customary to not have blank line between variable setting and test on it. In scheduler BPF code, we've been doing if ((var = expression))
a lot, so maybe adopt the style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh LMK if I got that right. I follow do that wrt/ simple assignments/checks, but am not sure if it should be done w/ more complex ones (i.e. I see more places I could do that, not sure if I should though).
if (!cpumask) | ||
return -ENOMEM; | ||
|
||
bpf_for(j, 0, MAX_CPUS/64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add comments explaining what each block is doing?
} | ||
} | ||
|
||
// pay init cost once for faster lookups later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need per-cpu copies? Can't this be a part of cpu_ctx
? Note that percpu maps have a limited number of hot-caches per task and has a performance cliff beyond.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need per-cpu copies?
This is called via set_cpumask. IIUC, this can be called from any CPU, for any task, and would be called for all tasks on workloads utilizing cpuset. This is per cpu to reduce lookup cost on high core count machines (where cpuset would more likely be used).
Can't this be a part of cpu_ctx?
I don't think so. cpu_ctx is a map with one entry. This map has an entry per cpuset on the system (i.e. the per-cpu duplication is solely to make runtime lookups faster).
Note that percpu maps have a limited number of hot-caches per task and has a performance cliff beyond.
I can run an AB vs EEVDF w/ this (which isn't the highest bar), sort of like how the regression detector does that, but I'm not sure how to tell if I've crossed that threshold tbh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need per-cpu copies?
This is called via set_cpumask. IIUC, this can be called from any CPU, for any task, and would be called for all tasks on workloads utilizing cpuset. This is per cpu to reduce lookup cost on high core count machines (where cpuset would more likely be used).
I'm not sure how that matters. This is mostly read-only data. There's no synchronization involved in looking up array map element. It doesn't matter how many cpus are looking these up. The cachelines would just be shared across the caches.
Note that percpu maps have a limited number of hot-caches per task and has a performance cliff beyond.
I can run an AB vs EEVDF w/ this (which isn't the highest bar), sort of like how the regression detector does that, but I'm not sure how to tell if I've crossed that threshold tbh?
All percpu map users share the same cache whether that's sched_ext or something else. The current cache entry limit is 16 and if the number of active percpu maps go over that, the lookups will fall back to slow path which can be significantly slower. This is a system-wide limit and the higher number of per-cpu maps you use, the more likely this will become a problem, so it's best to reduce the number as much as possible. Besides, I just don't understand why per-cpu map is used in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation!
Besides, I just don't understand why per-cpu map is used in the first place.
I was just trying to do everything per cpu for multi-node machines (which I now see was misguided and counterproductive because of that cache entry limit).
Will add documentation w/ the other updates. The TL;DR wrt/ theory of operation is that, if cpusets are being used, the onus is on whoever is setting those to ensure they are LLC aligned or perf will be affected and the scheduler can't (or perhaps shouldn't) really fix that. |
2b36f07
to
9544831
Compare
I think I covered everything, LMK if not! I noted the guidance I didn't quite follow in responses and ran the test cases to confirm things still work. |
4800007
to
dbc279c
Compare
This isn't going to be true in a lot of cases. I'm not sure how this PR would behave when cpusets aren't necessarily LLC aligned. Maybe it's okay but we really should think it through. |
I don't see anything immediately off when running the test program I've been using on a system where all the cpusets are intentionally misaligned, and I've had that test running for a few minutes so I don't think misalignment would cause stalls (although, in testing we're definitely gonna find some, probably around layer growth/shrinkage, I think): Also I replaced percpu array use with regular array use. |
ecc8621
to
0e9e15e
Compare
0e9e15e
to
21ba408
Compare
21ba408
to
c3c6d5d
Compare
I need to debug this because it is still has all tasks going to lo fallback when I run:
https://github.com/likewhatevs/perfstuff/blob/main/noisy-workload.compose.yml
That being said, this does run/pass the verifier and has all the idk components needed to make this work so lmk thoughts on the approach etc.