layered cpuset support #1747

likewhatevs · 2025-04-25T06:34:28Z

I need to debug this because it is still has all tasks going to lo fallback when I run:

https://github.com/likewhatevs/perfstuff/blob/main/noisy-workload.compose.yml

That being said, this does run/pass the verifier and has all the idk components needed to make this work so lmk thoughts on the approach etc.

etsal

So AFAICT The changes are:

Machinery for forwarding cpuset information to the BPF side
The corresponding BPF code
Logic that replicates allow_node_aligned but for cpusets.

If this fixes cpuset related problems I think it is reasonable, but I'm not sure about the naming - contianer enable is a bit confusing because containers aren't really a thing at this level of abstraction. Maybe replace "container" with "cpuset-based workloads"? This way it's clear what the code does concretely.

scheds/rust/scx_layered/src/bpf/main.bpf.c

scheds/rust/scx_layered/src/bpf/intf.h

likewhatevs · 2025-05-12T04:05:24Z

got this working right (finally, lol), I think.

most of the changes are gymnastics around getting bitmasks from rust to bpf cpumasks in a way that keeps the verifier happy w/o messing up the bitmask.

EDIT -- I also ran stress-ng w/ tasks affinitized w/ a mask not matching that of any of the containers running on the system and those tasks went to lo fallback as they should:

scheds/rust/scx_layered/src/bpf/main.bpf.c

likewhatevs · 2025-05-12T14:32:11Z

still works w/ the indexes fixed/renames etc.:

cpuset workload only:

cpuset workload + random affinity workload:

htejun

I still have trouble understanding how this is supposed to work. Node aligned is easier because DSQs are LLC aligned and LLCs are node aligned. We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?

htejun · 2025-05-12T17:30:36Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -1381,9 +1399,11 @@ void BPF_STRUCT_OPS(layered_enqueue, struct task_struct *p, u64 enq_flags)
 	 * without making the whole scheduler node aware and should only be used
 	 * with open layers on non-saturated machines to avoid possible stalls.
 	 */


Please update the comment to explain cpus_cpuset_aligned.

htejun · 2025-05-12T17:39:14Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -2658,7 +2678,7 @@ static void refresh_cpus_flags(struct task_ctx *taskc,

 		if (!(nodec = lookup_node_ctx(node_id)) ||
 		    !(node_cpumask = cast_mask(nodec->cpumask)))
-			return;
+			break;


This is scx_bpf_error() condition. There's no point in continuing.

made this an explicit error instead of just a return, thanks.

htejun · 2025-05-12T17:40:26Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc,
 			break;
 		}
 	}
+	if (enable_cpuset) {


Maybe a blank line above?

htejun · 2025-05-12T17:41:06Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -2667,6 +2687,21 @@ static void refresh_cpus_flags(struct task_ctx *taskc,
 			break;
 		}
 	}
+	if (enable_cpuset) {
+		bpf_for(cpuset_id, 0, nr_cpusets) {
+			struct cpumask_wrapper* wrapper;


Blank line.

I think I figured out the whitespace convention (before if, before for, sometimes after for), but LMK if there's still places it's wrong.

Hmmm... yeah, on machines with single LLC nodes, this is fine, but if there are multiple LLCs per node, the current allow_node_aligned condition can lead to unexpected behaviors as tasks can end up in a DSQ that a layer doesn't have CPUs on. But going back to this PR, if cpusets aren't aligned with LLCs, wouldn't we have a similar problem?

Hmmm... yeah, on machines with single LLC nodes, this is fine, but if there are multiple LLCs per node, the current allow_node_aligned condition can lead to unexpected behaviors as tasks can end up in a DSQ that a layer doesn't have CPUs on.

Will fix that in a separate PR, thx.

But going back to this PR, if cpusets aren't aligned with LLCs, wouldn't we have a similar problem?

Yeah, I was just able to produce that kind of behavior w/ my test setup, gonna have to add something to deal with that.

htejun · 2025-05-12T17:42:17Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+			}
+			if (bpf_cpumask_equal(cast_mask(wrapper->mask), cpumask)) {
+				taskc->cpus_cpuset_aligned = true;
+				return;


break; here so that it's consistent with the node aligned block and the function can be extended in the future? Note that this would require moving the false setting. BTW, why not use the same partial overlapping test used by node alignment test instead of equality test? Is that not sufficient for forward progress guarantee? If not, it'd probably be worthwhile to explain why.

Ahh, correct me if I am wrong, that partial test is wrong/the node code should do what the cpuset code does?

For example:

1000 task_mask 1100 system_mask system_mask intersect task_mask == true system_mask subset task_mask == false

IIUC, allow node aligned is only supposed to avoid lo fb when task cpumask matches node cpumasks to ensure lo fb is used for tasks with single cpu affinities (i.e. which are stall prone)?

htejun · 2025-05-12T17:47:21Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+	if (enable_cpuset) {
+		bpf_for(i, 0, nr_cpusets) {
+			cpumask = bpf_cpumask_create();
+


It's also customary to not have blank line between variable setting and test on it. In scheduler BPF code, we've been doing if ((var = expression)) a lot, so maybe adopt the style?

Ahh LMK if I got that right. I follow do that wrt/ simple assignments/checks, but am not sure if it should be done w/ more complex ones (i.e. I see more places I could do that, not sure if I should though).

scheds/rust/scx_layered/src/bpf/main.bpf.c

htejun · 2025-05-12T17:50:25Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+			if (!cpumask)
+				return -ENOMEM;
+
+			bpf_for(j, 0, MAX_CPUS/64) {


Maybe add comments explaining what each block is doing?

htejun · 2025-05-12T17:53:08Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+				}
+			}
+
+			// pay init cost once for faster lookups later.


Why do we need per-cpu copies? Can't this be a part of cpu_ctx? Note that percpu maps have a limited number of hot-caches per task and has a performance cliff beyond.

Why do we need per-cpu copies?

This is called via set_cpumask. IIUC, this can be called from any CPU, for any task, and would be called for all tasks on workloads utilizing cpuset. This is per cpu to reduce lookup cost on high core count machines (where cpuset would more likely be used).

Can't this be a part of cpu_ctx?

I don't think so. cpu_ctx is a map with one entry. This map has an entry per cpuset on the system (i.e. the per-cpu duplication is solely to make runtime lookups faster).

Note that percpu maps have a limited number of hot-caches per task and has a performance cliff beyond.

I can run an AB vs EEVDF w/ this (which isn't the highest bar), sort of like how the regression detector does that, but I'm not sure how to tell if I've crossed that threshold tbh?

Why do we need per-cpu copies?

This is called via set_cpumask. IIUC, this can be called from any CPU, for any task, and would be called for all tasks on workloads utilizing cpuset. This is per cpu to reduce lookup cost on high core count machines (where cpuset would more likely be used).

I'm not sure how that matters. This is mostly read-only data. There's no synchronization involved in looking up array map element. It doesn't matter how many cpus are looking these up. The cachelines would just be shared across the caches.

Note that percpu maps have a limited number of hot-caches per task and has a performance cliff beyond.

I can run an AB vs EEVDF w/ this (which isn't the highest bar), sort of like how the regression detector does that, but I'm not sure how to tell if I've crossed that threshold tbh?

All percpu map users share the same cache whether that's sched_ext or something else. The current cache entry limit is 16 and if the number of active percpu maps go over that, the lookups will fall back to slow path which can be significantly slower. This is a system-wide limit and the higher number of per-cpu maps you use, the more likely this will become a problem, so it's best to reduce the number as much as possible. Besides, I just don't understand why per-cpu map is used in the first place.

Thank you for the explanation!

Besides, I just don't understand why per-cpu map is used in the first place.

I was just trying to do everything per cpu for multi-node machines (which I now see was misguided and counterproductive because of that cache entry limit).

likewhatevs · 2025-05-12T18:23:17Z

We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?

Will add documentation w/ the other updates. The TL;DR wrt/ theory of operation is that, if cpusets are being used, the onus is on whoever is setting those to ensure they are LLC aligned or perf will be affected and the scheduler can't (or perhaps shouldn't) really fix that.

likewhatevs · 2025-05-16T04:02:39Z

I think I covered everything, LMK if not! I noted the guidance I didn't quite follow in responses and ran the test cases to confirm things still work.

htejun · 2025-05-19T21:44:24Z

We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?

Will add documentation w/ the other updates. The TL;DR wrt/ theory of operation is that, if cpusets are being used, the onus is on whoever is setting those to ensure they are LLC aligned or perf will be affected and the scheduler can't (or perhaps shouldn't) really fix that.

This isn't going to be true in a lot of cases. I'm not sure how this PR would behave when cpusets aren't necessarily LLC aligned. Maybe it's okay but we really should think it through.

likewhatevs · 2025-05-19T22:24:17Z

We don't have the guarantee that cpusets are LLC aligned. Is that okay? If so, why? Can you please document the theory of operation?

Will add documentation w/ the other updates. The TL;DR wrt/ theory of operation is that, if cpusets are being used, the onus is on whoever is setting those to ensure they are LLC aligned or perf will be affected and the scheduler can't (or perhaps shouldn't) really fix that.

This isn't going to be true in a lot of cases. I'm not sure how this PR would behave when cpusets aren't necessarily LLC aligned. Maybe it's okay but we really should think it through.

I don't see anything immediately off when running the test program I've been using on a system where all the cpusets are intentionally misaligned, and I've had that test running for a few minutes so I don't think misalignment would cause stalls (although, in testing we're definitely gonna find some, probably around layer growth/shrinkage, I think):

Also I replaced percpu array use with regular array use.

likewhatevs · 2025-05-19T23:57:42Z

My prior test w/ misaligned LLCs was bad (either I forgot to docker down/up to update cgroups or restart the scheduler).

I think this confirms things work as expected.

Htop shows funny cpusets are processing load:

Layered stats show cpuset affinitized work not going to lo fallbacks:

The cpusets specified match cpus used in htop above:

I think this means that perf will suffer if cpuset's aren't LLC aligned, but this is OK/best that can be done in that case.

likewhatevs requested review from htejun, etsal and kkdwivedi April 25, 2025 06:34

likewhatevs marked this pull request as draft April 25, 2025 06:42

likewhatevs force-pushed the layered-container-support-2 branch from 612f186 to 64f4056 Compare April 25, 2025 13:53

etsal reviewed Apr 25, 2025

View reviewed changes

scheds/rust/scx_layered/src/bpf/main.bpf.c Outdated Show resolved Hide resolved

htejun reviewed Apr 28, 2025

View reviewed changes

scheds/rust/scx_layered/src/bpf/intf.h Outdated Show resolved Hide resolved

likewhatevs force-pushed the layered-container-support-2 branch 2 times, most recently from 495915d to 33cb330 Compare May 6, 2025 15:06

likewhatevs force-pushed the layered-container-support-2 branch from 33cb330 to ed4c527 Compare May 12, 2025 03:58

likewhatevs marked this pull request as ready for review May 12, 2025 04:01

likewhatevs changed the title ~~layered container support~~ layered cpuset support May 12, 2025

etsal reviewed May 12, 2025

View reviewed changes

scheds/rust/scx_layered/src/bpf/main.bpf.c Outdated Show resolved Hide resolved

etsal reviewed May 12, 2025

View reviewed changes

scheds/rust/scx_layered/src/bpf/main.bpf.c Outdated Show resolved Hide resolved

etsal reviewed May 12, 2025

View reviewed changes

scheds/rust/scx_layered/src/bpf/main.bpf.c Show resolved Hide resolved

etsal reviewed May 12, 2025

View reviewed changes

scheds/rust/scx_layered/src/bpf/main.bpf.c Outdated Show resolved Hide resolved

htejun reviewed May 12, 2025

View reviewed changes

likewhatevs marked this pull request as draft May 12, 2025 18:16

likewhatevs force-pushed the layered-container-support-2 branch 2 times, most recently from 2b36f07 to 9544831 Compare May 16, 2025 03:54

likewhatevs marked this pull request as ready for review May 16, 2025 04:01

likewhatevs force-pushed the layered-container-support-2 branch from 4800007 to dbc279c Compare May 16, 2025 04:24

likewhatevs marked this pull request as draft May 19, 2025 21:40

likewhatevs marked this pull request as ready for review May 19, 2025 22:24

likewhatevs marked this pull request as draft May 19, 2025 23:14

likewhatevs marked this pull request as ready for review May 19, 2025 23:58

likewhatevs marked this pull request as draft May 20, 2025 22:16

likewhatevs closed this May 21, 2025

likewhatevs reopened this May 22, 2025

likewhatevs force-pushed the layered-container-support-2 branch 2 times, most recently from ecc8621 to 0e9e15e Compare May 23, 2025 23:25

scx_layered: associate each layer with a cpumask

7370447

likewhatevs force-pushed the layered-container-support-2 branch from 0e9e15e to 21ba408 Compare May 23, 2025 23:29

layered: get cpuset as numa running

c3c6d5d

likewhatevs force-pushed the layered-container-support-2 branch from 21ba408 to c3c6d5d Compare May 23, 2025 23:31

likewhatevs closed this May 23, 2025

layered cpuset support #1747

layered cpuset support #1747

Uh oh!

Conversation

likewhatevs commented Apr 25, 2025

Uh oh!

etsal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

likewhatevs commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

likewhatevs commented May 12, 2025

Uh oh!

htejun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

likewhatevs commented May 12, 2025

Uh oh!

likewhatevs commented May 16, 2025

Uh oh!

htejun commented May 19, 2025

Uh oh!

likewhatevs commented May 19, 2025

Uh oh!

likewhatevs commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

likewhatevs commented May 12, 2025 •

edited

Loading

likewhatevs commented May 19, 2025 •

edited

Loading