Why is the CPU overhead of tokio tasks significantly higher than that of goroutines? #6257
-
Someone else did a benchmark: Goroutine
Tokio Task
Another tokio task version with channels:
I know that we're not supposed to use tokio for
while goroutine is designed as a "general-purpose" coroutine. I'm just curious: what was the tokio runtime doing that costed those extra CPU cycles? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
My guess would be that Go has a better implementation of |
Beta Was this translation helpful? Give feedback.
-
Is there any way to work around this in a benchmark? I had tried this, but it didn't help:
|
Beta Was this translation helpful? Give feedback.
-
I made a flamegraph of the rust atomic one: cargo flamegraph --release -c 'record --call-graph dwarf,16384 -e cpu-clock' TLDR: All the tokio timers are kept behind a single RwLock backed by a mutex in the tokio global timer driver. The CPU usage is contention trying to lock that mutex, where 10k timers are all trying to read, and the tokio runtime thread::park is trying to get an exclusive lock it can find the next timer to wait for. The timer poll_elapsed calls reregister does a read lock to access this linked list of timers. Also the lock in written to every time a thread is parked so it can set the next wake by the tokio scheduler. I tried a few workarounds, including using timers from other crates, randomizing the wait time, and having a single real thread to a thread::sleep() then broadcast channel all the workers, but they all seemed to end up with mutex contention. I made a single threaded version that runs ok (it still uses 37% of a single core though with 10k tasks though): use tokio::{
io::{AsyncBufReadExt, BufReader},
time::sleep,
};
// static NUM: AtomicI64 = AtomicI64::new(0);
static mut NUM: usize = 0;
async fn fff() {
let t = tokio::time::Duration::from_millis(15);
loop {
sleep(t).await;
// SAFETY: We're running a single threaded app
unsafe { NUM += 1 };
}
}
fn main() {
let rt = tokio::runtime::Builder::new_current_thread()
.enable_time()
.build()
.unwrap();
let _enter_guard = rt.enter();
let mut handles = Vec::new();
for _ in 0..10000 {
println!("Spawning");
let handle = tokio::spawn(fff());
handles.push(handle);
}
println!("over");
let end = rt.spawn(async {
let stdin = tokio::io::stdin();
let reader = BufReader::new(stdin);
let mut lines = reader.lines();
loop {
lines.next_line().await.unwrap();
// SAFETY: We're running in a single threaded app
let num = unsafe { NUM };
println!("{num} {}", chrono::Local::now().format("%Y-%m-%d %H:%M:%S"));
}
});
rt.block_on(end).unwrap();
} and the Cargo.toml: [package]
name = "rust-benchmark"
version = "0.1.0"
edition = "2021"
[dependencies]
chrono = "0.4.39"
rand = "0.8.5"
tokio = { version = "1.42.0", features = ["io-std", "io-util", "macros", "rt", "time"] } In the flamegraph of the single threaded one, it still On my machine , the go one takes version takes about 110% of a cpu core: I'm guessing that the difference is that the go is maybe multi-process single threaded and doesn't have a global list of timers that needs access control. (Feel free to correct my assumptions go bros). |
Beta Was this translation helpful? Give feedback.
A few thoughts:
while let Ok(i)
part of your loop kills the workers when lagged errors occur. And they can occur in your program. In general,if let Ok
andwhile let Ok
is almost always wrong due to incorrect error handling. (whereasif let Some
/while let Some
usually isn't wrong)