-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(jsonrpc): fix deadlock in get_all_accounts() #6726
Conversation
This is me trying to fix deltachat/deltachat-desktop#4842 (comment) |
e2c68ab
to
d4bbe46
Compare
I still have no idea why it fixes anything, but now we have a regression test that gets stuck reliably without a fix. |
893d851
to
9cac074
Compare
As someone who doesn't really know the details of how JSON-RPC works, I have a hand-wavy hypothesis (it could be complete gibberish). Maybe it has to do with sequential handling of JSON-RPC requests (I'm not sure if it's a thing though). Thanks for fixing it! |
Events are also requests, they are received by calling Also JSON-RPC requests can run in parallel, there is even a test for this running |
@@ -227,8 +227,9 @@ impl CommandApi { | |||
/// Get a list of all configured accounts. | |||
async fn get_all_accounts(&self) -> Result<Vec<Account>> { | |||
let mut accounts = Vec::new(); | |||
for id in self.accounts.read().await.get_all() { | |||
let context_option = self.accounts.read().await.get_account(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some debug logging and it turns out get_all_accounts
get stuck trying to get a lock here.
I think the problem is that the first read()
gets the lock, then add_account()
gets into the queue to get the write lock, then this line tries to get another read lock. But tokio RwLock is write-preferring and will not allow another reader when there is a writer in the queue: https://docs.rs/tokio/latest/tokio/sync/struct.RwLock.html
So writer waits forever, and first read lock is also not released by get_all_accounts
because it wants a second read lock.
It's not obvious, but I think first read()
lock is not released until the end of for
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right. https://docs.rs/tokio/latest/tokio/sync/struct.RwLock.html#method.read docs also mention this:
deadlock may occur if a read lock is held by the current task, a write lock attempt is made, and then a subsequent read lock attempt is made by the current task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this reference to the commit message.
9cac074
to
7f6d444
Compare
7f6d444
to
8098444
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great job with the debugging!
`self.accounts.read().await.get_all()` acquires a read lock and does not release it until the end of `for` loop. After that, a writer may get into the queue, e.g. because of the concurrent `add_account` call. In this case `let context_option = self.accounts.read().await.get_account(id);` tries to acquire another read lock and deadlocks because tokio RwLock is write-preferring and will not give another read lock while there is a writer in the queue. At the same time, writer never gets a write lock because the first read lock is not released. The fix is to get a single read lock for the whole `get_all_accounts()` call. This is described in <https://docs.rs/tokio/1.44.1/tokio/sync/struct.RwLock.html#method.read>: "Note that under the priority policy of RwLock, read locks are not granted until prior write locks, to prevent starvation. Therefore deadlock may occur if a read lock is held by the current task, a write lock attempt is made, and then a subsequent read lock attempt is made by the current task."
8098444
to
7056386
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Btw, we can use smth like https://docs.rs/timed-locks/latest/timed_locks/ just for safety, see Motivation paragraph there, it exactly tells about deadlocks. Of course we need errors, not panicking, this adds complexity, but if an alternative is hanging the whole program, it may make sense. Not a Rust invention of course, some C libs implementing their own locking have the same locks, looks usable as for me. |
Timeouts are not good for the tests because if CI runs in a VM that may be paused, these timeouts will fire from time to time randomly if the runner was suspended at the wrong moment.
There is an async lock https://docs.rs/async-lock/latest/async_lock/struct.RwLock.html, but it seems to be the same as tokio, "write-preferring". |
We can use different timeouts depending on configuration, there's |
It appears that the bug that this MR fixes is also the reason why our E2E CI has been failing from time to time https://github.com/deltachat/deltachat-desktop/actions/runs/14255114113/job/39956261540 I thought that it's because somehow clicking "paste" does nothing. But if you look at the screenshot, you can see the good old hourglass being displayed on the new account |
self.accounts.read().await.get_all()
acquires a read lockand does not release it until the end of
for
loop.After that, a writer may get into the queue,
e.g. because of the concurrent
add_account
call.In this case
let context_option = self.accounts.read().await.get_account(id);
tries to acquire another read lock and deadlocks
because tokio RwLock is write-preferring and will not
give another read lock while there is a writer in the queue.
At the same time, writer never gets a write lock
because the first read lock is not released.
The fix is to get a single read lock
for the whole
get_all_accounts()
call.This is described in https://docs.rs/tokio/1.44.1/tokio/sync/struct.RwLock.html#method.read:
"Note that under the priority policy of RwLock, read locks are not
granted until prior write locks, to prevent starvation. Therefore
deadlock may occur if a read lock is held by the current task, a write
lock attempt is made, and then a subsequent read lock attempt is made by
the current task."