Why is the CPU overhead of tokio tasks significantly higher than that of goroutines? #6257

Sunshine40 · 2023-12-31T10:27:44Z

Sunshine40
Dec 31, 2023

Someone else did a benchmark:

Goroutine

package main

import (
	"fmt"
	"sync/atomic"
	"time"
)

var AAA atomic.Int64

func fff() {
	for {
		time.Sleep(time.Millisecond * 15)
		AAA.Add(1)
	}
}
func main() {
	for i := 0; i < 10000; i++ {
		go fff()
	}
	fmt.Println("begin")
	var s string
	for {
		fmt.Scanln(&s)
		fmt.Println(AAA.Load(), time.Now())
	}
}

Tokio Task

use std::sync::atomic::{AtomicI64, Ordering};
static NUM: AtomicI64 = AtomicI64::new(0);
async fn fff() {
    let t = tokio::time::Duration::from_millis(15);
    loop {
        tokio::time::sleep(t).await;
        NUM.fetch_add(1, Ordering::Relaxed);
    }
}

#[tokio::main]
async fn main() {
    let mut i = 0;

    while i < 10000 {
        tokio::task::spawn(fff());
        i = i + 1;
    }
    println!("over");
    loop {
        let mut s = String::new();
        std::io::stdin().read_line(&mut s).unwrap();
        println!(
            "{0} {1}",
            NUM.load(Ordering::Relaxed),
            chrono::Local::now().format("%Y-%m-%d %H:%M:%S").to_string()
        );
    }
}

Another tokio task version with channels:

use tokio::sync::{mpsc, watch};

async fn collector(mut reporter: mpsc::Receiver<usize>, publisher: watch::Sender<usize>) {
    let mut sum = 0;
    while let Some(i) = reporter.recv().await {
        sum += i;
        publisher.send(sum).unwrap()
    }
}

async fn worker(reporter: mpsc::Sender<usize>) {
    let mut interval = tokio::time::interval(tokio::time::Duration::from_millis(15));
    loop {
        interval.tick().await;
        reporter.send(1).await.unwrap()
    }
}

#[tokio::main()]
async fn main() {
    let (mpsc_tx, mpsc_rx) = mpsc::channel(16);
    let (watch_tx, watch_rx) = watch::channel(0);

    for _ in 0..100000 {
        tokio::task::spawn(worker(mpsc_tx.clone()));
    }
    tokio::task::spawn(collector(mpsc_rx, watch_tx));

    println!("preparation over");

    tokio::task::spawn_blocking(move || loop {
        let mut s = String::new();
        std::io::stdin().read_line(&mut s).unwrap();
        println!(
            "{0} {1}",
            *watch_rx.borrow(),
            chrono::Local::now().format("%Y-%m-%d %H:%M:%S")
        );
    })
    .await
    .unwrap();
}

(Tested on my own laptop with an i7-10750H CPU)

(Goroutine version on my laptop)

I know that we're not supposed to use tokio for

Speeding up CPU-bound computations by running them in parallel on several threads. Tokio is designed for IO-bound applications where each individual task spends most of its time waiting for IO. If the only thing your application does is run computations in parallel, you should be using rayon.

while goroutine is designed as a "general-purpose" coroutine.

I'm just curious: what was the tokio runtime doing that costed those extra CPU cycles?

Answered by Darksonn

Dec 31, 2023

A few thoughts:

To compare performance of things other than timers, remove timers from the benchmark. That is, you need to change Go to use channels too.
Although I misread your channels benchmark, it doesn't do what you think it does either. The while let Ok(i) part of your loop kills the workers when lagged errors occur. And they can occur in your program. In general, if let Ok and while let Ok is almost always wrong due to incorrect error handling. (whereas if let Some/while let Some usually isn't wrong)
You are testing this on Windows. Windows is well known to terrible timers, and comparing anything using timers on Windows is almost meaningless, especially if it's something like a we…

View full answer

Darksonn · 2023-12-31T11:58:48Z

Darksonn
Dec 31, 2023
Maintainer

My guess would be that Go has a better implementation of sleep.

2 replies

Sunshine40 Dec 31, 2023
Author

I think the critical problem here is: how to prove this (namely "it's just tokio::time::sleep being inefficient, not the runtime itself being inefficient")

Darksonn Dec 31, 2023
Maintainer

A few thoughts:

To compare performance of things other than timers, remove timers from the benchmark. That is, you need to change Go to use channels too.
Although I misread your channels benchmark, it doesn't do what you think it does either. The while let Ok(i) part of your loop kills the workers when lagged errors occur. And they can occur in your program. In general, if let Ok and while let Ok is almost always wrong due to incorrect error handling. (whereas if let Some/while let Some usually isn't wrong)
You are testing this on Windows. Windows is well known to terrible timers, and comparing anything using timers on Windows is almost meaningless, especially if it's something like a web server that's going to run on a Linux server anyway. (See this part of the Go runtime and #5021.) I believe that Go has special handling for timers on Windows to make them less bad, which Tokio does not.
Using benchmarks with sleep is fraught with issues. Your loops do not run once every 15 ms, because any extra time between the timer elapsing and the next timer being started is not counted. This is especially problematic on platforms with highly imprecise timers such as Windows, but even then keep in mind that Tokio's timers have a maximum resolution of 1ms on all platforms. You will want something like tokio::time::interval that works around this for both Rust and Go to make them execute the same number of iterations.
When I run your code (the versions with many timers) on my own Linux machine, Go still wins but the difference is significantly smaller at ~1.5x CPU usage compared to your 10x. Interestingly, if I change them to be single-threaded (with #[tokio::main(flavor = "current_thread")] and GOMAXPROCS=1), then Rust actually wins at 45% vs 55% cpu-usage.
Although they have probably diverged in various ways at this point, the Tokio runtime implementation is mostly taken from the Go runtime implementation, so I would expect the runtimes to have similar base performance.
Often, cpu usage is rather misleading as a benchmarking metric, especially if captured using a task manager. Usually, measuring the throughput and tail latencies you get when the system is full saturated is more meaningful. If you must measure systems that are not fully saturated, then please take a flamegraph of the executions.

Answer selected by Sunshine40

Sunshine40 · 2023-12-31T12:28:57Z

Sunshine40
Dec 31, 2023
Author

My guess would be that Go has a better implementation of sleep.

Is there any way to work around this in a benchmark?

I had tried this, but it didn't help:

use tokio::sync::{broadcast, mpsc, watch};

async fn ticker(tx: broadcast::Sender<usize>) {
    let mut interval = tokio::time::interval(tokio::time::Duration::from_millis(15));
    loop {
        interval.tick().await;
        tx.send(1).unwrap();
    }
}

async fn collector(mut reporter: mpsc::Receiver<usize>, publisher: watch::Sender<usize>) {
    let mut sum = 0;
    while let Some(i) = reporter.recv().await {
        sum += i;
        publisher.send(sum).unwrap()
    }
}

async fn worker(mut rx: broadcast::Receiver<usize>, tx: mpsc::Sender<usize>) {
    while let Ok(i) = rx.recv().await {
        tx.send(i).await.unwrap()
    }
}

#[tokio::main]
async fn main() {
    let (broadcast_tx, _) = broadcast::channel(16);
    let (mpsc_tx, mpsc_rx) = mpsc::channel(16);
    let (watch_tx, watch_rx) = watch::channel(0);

    for _ in 0..10000 {
        tokio::task::spawn(worker(broadcast_tx.subscribe(), mpsc_tx.clone()));
    }
    tokio::task::spawn(collector(mpsc_rx, watch_tx));

    tokio::task::spawn(ticker(broadcast_tx));

    println!("preparation over");

    tokio::task::spawn_blocking(move || loop {
        let mut s = String::new();
        std::io::stdin().read_line(&mut s).unwrap();
        println!(
            "{0} {1}",
            *watch_rx.borrow(),
            chrono::Local::now().format("%Y-%m-%d %H:%M:%S")
        );
    })
    .await
    .unwrap();
}

2 replies

Darksonn Dec 31, 2023
Maintainer

That still has the same amount of timers, and just introduces more work with the channels.

And you say "work around", but what exactly do you mean by that? What do you want to compare?

Sunshine40 Dec 31, 2023
Author

That still has the same amount of timers

The interval timer inside ticker is shared among workers via channels, what do you mean by "same amount of timers"?

What do you want to compare?

It's all about choosing tech stack. If you struggle against the rust compiler and get a benchmark program which consumes 10x more CPU resource than a simple Go snippet, all the advertisement about "tokio being fast, rust being fast, stackless coroutines + zero-cost abstractions being great" gets less convincing.

In short, if developers have to struggle along the way to the world of Rust, they would at least expect some techniques that can make the benchmark result less embarassing.

matiu2 · 2024-12-29T04:09:36Z

matiu2
Dec 29, 2024

I made a flamegraph of the rust atomic one:

cargo flamegraph --release -c 'record --call-graph dwarf,16384 -e cpu-clock'

TLDR: All the tokio timers are kept behind a single RwLock backed by a mutex in the tokio global timer driver. The CPU usage is contention trying to lock that mutex, where 10k timers are all trying to read, and the tokio runtime thread::park is trying to get an exclusive lock it can find the next timer to wait for.

contention is slow.

The timer poll_elapsed calls reregister does a read lock to access this linked list of timers.

Also the lock in written to every time a thread is parked so it can set the next wake by the tokio scheduler.

I tried a few workarounds, including using timers from other crates, randomizing the wait time, and having a single real thread to a thread::sleep() then broadcast channel all the workers, but they all seemed to end up with mutex contention.

I made a single threaded version that runs ok (it still uses 37% of a single core though with 10k tasks though):

use tokio::{
    io::{AsyncBufReadExt, BufReader},
    time::sleep,
};

// static NUM: AtomicI64 = AtomicI64::new(0);
static mut NUM: usize = 0;

async fn fff() {
    let t = tokio::time::Duration::from_millis(15);
    loop {
        sleep(t).await;
        // SAFETY: We're running a single threaded app
        unsafe { NUM += 1 };
    }
}

fn main() {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_time()
        .build()
        .unwrap();
    let _enter_guard = rt.enter();
    let mut handles = Vec::new();
    for _ in 0..10000 {
        println!("Spawning");
        let handle = tokio::spawn(fff());
        handles.push(handle);
    }
    println!("over");
    let end = rt.spawn(async {
        let stdin = tokio::io::stdin();
        let reader = BufReader::new(stdin);
        let mut lines = reader.lines();
        loop {
            lines.next_line().await.unwrap();
            // SAFETY: We're running in a single threaded app
            let num = unsafe { NUM };
            println!("{num} {}", chrono::Local::now().format("%Y-%m-%d %H:%M:%S"));
        }
    });
    rt.block_on(end).unwrap();
}

and the Cargo.toml:

[package]
name = "rust-benchmark"
version = "0.1.0"
edition = "2021"

[dependencies]
chrono = "0.4.39"
rand = "0.8.5"
tokio = { version = "1.42.0", features = ["io-std", "io-util", "macros", "rt", "time"] }

In the flamegraph of the single threaded one, it still reregister that's taking all the cpu:

On my machine , the go one takes version takes about 110% of a cpu core:

I'm guessing that the difference is that the go is maybe multi-process single threaded and doesn't have a global list of timers that needs access control. (Feel free to correct my assumptions go bros).

2 replies

matiu2 Dec 29, 2024

I submited this bug report - we'll see what happens :)

Darksonn Dec 29, 2024
Maintainer

We are well aware that contention is slow :)

The timers were previously protected by a single mutex, but they are now sharded into several timer wheels to reduce contention in #6534. Unfortunately, the change you suggested is incorrect. See my reply on the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the CPU overhead of tokio tasks significantly higher than that of goroutines? #6257

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why is the CPU overhead of tokio tasks significantly higher than that of goroutines? #6257

Sunshine40 Dec 31, 2023

Goroutine

Tokio Task

Replies: 3 comments · 6 replies

Darksonn Dec 31, 2023 Maintainer

Sunshine40 Dec 31, 2023 Author

Darksonn Dec 31, 2023 Maintainer

Sunshine40 Dec 31, 2023 Author

Darksonn Dec 31, 2023 Maintainer

Sunshine40 Dec 31, 2023 Author

matiu2 Dec 29, 2024

matiu2 Dec 29, 2024

Darksonn Dec 29, 2024 Maintainer

Sunshine40
Dec 31, 2023

Replies: 3 comments 6 replies

Darksonn
Dec 31, 2023
Maintainer

Sunshine40 Dec 31, 2023
Author

Darksonn Dec 31, 2023
Maintainer

Sunshine40
Dec 31, 2023
Author

Darksonn Dec 31, 2023
Maintainer

Sunshine40 Dec 31, 2023
Author

matiu2
Dec 29, 2024

Darksonn Dec 29, 2024
Maintainer