Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(jsonrpc): fix deadlock in get_all_accounts() #6726

Merged
merged 1 commit into from
Apr 1, 2025

Conversation

link2xt
Copy link
Collaborator

@link2xt link2xt commented Apr 1, 2025

self.accounts.read().await.get_all() acquires a read lock
and does not release it until the end of for loop.
After that, a writer may get into the queue,
e.g. because of the concurrent add_account call.
In this case let context_option = self.accounts.read().await.get_account(id);
tries to acquire another read lock and deadlocks
because tokio RwLock is write-preferring and will not
give another read lock while there is a writer in the queue.
At the same time, writer never gets a write lock
because the first read lock is not released.

The fix is to get a single read lock
for the whole get_all_accounts() call.

This is described in https://docs.rs/tokio/1.44.1/tokio/sync/struct.RwLock.html#method.read:
"Note that under the priority policy of RwLock, read locks are not
granted until prior write locks, to prevent starvation. Therefore
deadlock may occur if a read lock is held by the current task, a write
lock attempt is made, and then a subsequent read lock attempt is made by
the current task."

@link2xt link2xt requested review from iequidoo and Simon-Laux April 1, 2025 11:02
@link2xt
Copy link
Collaborator Author

link2xt commented Apr 1, 2025

This is me trying to fix deltachat/deltachat-desktop#4842 (comment)

@link2xt link2xt marked this pull request as draft April 1, 2025 11:25
@link2xt link2xt force-pushed the link2xt/get_all_accounts-lock branch from e2c68ab to d4bbe46 Compare April 1, 2025 11:32
@link2xt
Copy link
Collaborator Author

link2xt commented Apr 1, 2025

I still have no idea why it fixes anything, but now we have a regression test that gets stuck reliably without a fix.

@link2xt link2xt marked this pull request as ready for review April 1, 2025 11:33
@link2xt link2xt force-pushed the link2xt/get_all_accounts-lock branch 2 times, most recently from 893d851 to 9cac074 Compare April 1, 2025 11:49
@WofWca
Copy link
Collaborator

WofWca commented Apr 1, 2025

I still have no idea why it fixes anything

As someone who doesn't really know the details of how JSON-RPC works, I have a hand-wavy hypothesis (it could be complete gibberish).

Maybe it has to do with sequential handling of JSON-RPC requests (I'm not sure if it's a thing though). get_all_accounts keeps acquiring the read lock in a loop, so in theory select_account could acquire the write lock while get_all_accounts is in the middle of execution.
But the problem arises when JSON-RPC (hypothetically) won't send the response to select_account until get_all_accounts has finished, but get_all_accounts can't finish because select_account has finished.
This is also supported by the fact that, as I mentioned in deltachat/deltachat-desktop#4842 (comment), the JSON-RPC server will stop responding to all requests (though it will keep sending events).

Thanks for fixing it!

@link2xt
Copy link
Collaborator Author

link2xt commented Apr 1, 2025

though it will keep sending events

Events are also requests, they are received by calling get_next_event().

Also JSON-RPC requests can run in parallel, there is even a test for this running sleep calls.

@@ -227,8 +227,9 @@ impl CommandApi {
/// Get a list of all configured accounts.
async fn get_all_accounts(&self) -> Result<Vec<Account>> {
let mut accounts = Vec::new();
for id in self.accounts.read().await.get_all() {
let context_option = self.accounts.read().await.get_account(id);
Copy link
Collaborator Author

@link2xt link2xt Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some debug logging and it turns out get_all_accounts get stuck trying to get a lock here.

I think the problem is that the first read() gets the lock, then add_account() gets into the queue to get the write lock, then this line tries to get another read lock. But tokio RwLock is write-preferring and will not allow another reader when there is a writer in the queue: https://docs.rs/tokio/latest/tokio/sync/struct.RwLock.html
So writer waits forever, and first read lock is also not released by get_all_accounts because it wants a second read lock.

It's not obvious, but I think first read() lock is not released until the end of for.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. https://docs.rs/tokio/latest/tokio/sync/struct.RwLock.html#method.read docs also mention this:

deadlock may occur if a read lock is held by the current task, a write lock attempt is made, and then a subsequent read lock attempt is made by the current task.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this reference to the commit message.

@link2xt link2xt force-pushed the link2xt/get_all_accounts-lock branch from 9cac074 to 7f6d444 Compare April 1, 2025 12:27
@link2xt link2xt changed the title fix: simplify locking in JSON-RPC get_all_accounts() call fix(jsonrpc): fix deadlock in get_all_accounts() Apr 1, 2025
@link2xt link2xt force-pushed the link2xt/get_all_accounts-lock branch from 7f6d444 to 8098444 Compare April 1, 2025 12:29
Copy link
Collaborator

@WofWca WofWca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great job with the debugging!

`self.accounts.read().await.get_all()` acquires a read lock
and does not release it until the end of `for` loop.
After that, a writer may get into the queue,
e.g. because of the concurrent `add_account` call.
In this case `let context_option = self.accounts.read().await.get_account(id);`
tries to acquire another read lock and deadlocks
because tokio RwLock is write-preferring and will not
give another read lock while there is a writer in the queue.
At the same time, writer never gets a write lock
because the first read lock is not released.

The fix is to get a single read lock
for the whole `get_all_accounts()` call.

This is described in <https://docs.rs/tokio/1.44.1/tokio/sync/struct.RwLock.html#method.read>:
"Note that under the priority policy of RwLock, read locks are not
granted until prior write locks, to prevent starvation. Therefore
deadlock may occur if a read lock is held by the current task, a write
lock attempt is made, and then a subsequent read lock attempt is made by
the current task."
Copy link
Collaborator

@Hocuri Hocuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@iequidoo
Copy link
Collaborator

iequidoo commented Apr 2, 2025

Btw, we can use smth like https://docs.rs/timed-locks/latest/timed_locks/ just for safety, see Motivation paragraph there, it exactly tells about deadlocks. Of course we need errors, not panicking, this adds complexity, but if an alternative is hanging the whole program, it may make sense. Not a Rust invention of course, some C libs implementing their own locking have the same locks, looks usable as for me.

@link2xt
Copy link
Collaborator Author

link2xt commented Apr 2, 2025

Timeouts are not good for the tests because if CI runs in a VM that may be paused, these timeouts will fire from time to time randomly if the runner was suspended at the wrong moment.

parking_lot has experimental deadlock detection, but I don't know how it works and it is probably experimental for a reason. It is also not async.

There is an async lock https://docs.rs/async-lock/latest/async_lock/struct.RwLock.html, but it seems to be the same as tokio, "write-preferring".

@iequidoo
Copy link
Collaborator

iequidoo commented Apr 2, 2025

We can use different timeouts depending on configuration, there's RwLock::new_with_timeout() there and CI will still detect deadlocks. I'd avoid using read-preferring locks, at least we need to check the code to make sure we won't have starving writers. Moreover, read-preferring locks don't solve the deadlocks problem completely, one can still implement an ABBA deadlock using two mutexes. Maybe an RwLock with the following policy may help: a reader may acquire the RwLock if the oldest existing (holding or wating) lock is a reader too. If the oldest lock is a writer, the reader must go wait. But i'm not aware of existing implementations.

@WofWca
Copy link
Collaborator

WofWca commented Apr 4, 2025

It appears that the bug that this MR fixes is also the reason why our E2E CI has been failing from time to time

https://github.com/deltachat/deltachat-desktop/actions/runs/14255114113/job/39956261540

I thought that it's because somehow clicking "paste" does nothing. But if you look at the screenshot, you can see the good old hourglass being displayed on the new account

54aa90b57b9748020e430b5d8088427526b6e7ab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants