Stress test the network reliability #1436

locallycompact · 2024-05-16T11:31:53Z

Why

We have experienced many situations in which the head can not progress. These problems are hard to reproduce and we have spent a lot of time in coordination attempting to resolve the problem in each case.

Records of these issues are here:

#1374
#1415

One possible solution was a manual snapshot recovery as outlined here:

#1416

This is unsatisfying as we would prefer to make the nodes self-healing and not require manual intervention.

What

We should challenge the assumptions of the reliability layer. This is currently a combination of ouroboros-network and an implementation of Logged Uniform Reliable Broadcast found in https://fileadmin.cs.lth.se/cs/Personal/Amr_Ergawy/dist-algos-slides/fourth-presentation.pdf
We also want to challenge the assumption that the on-disk persistence of the vector clock and outbound messages is actually needed. Don't persist the network messages and their acknowledgements #1417
We also want to provide a way for users with stuck heads to collect diagnostic information and submit it to the team for analysis.

How

Create a test that stress tests the network layer in the case of three or more intermittently failing peers. A failing peer is a peer that fails to send, receive or persist network messages.
(Optional) Extract the network layer into its own package to remove coupling.

noonio · 2024-09-04T11:15:39Z

This is effectively resolved by #1552 . We will continue to iterate on what we have, but we've taken a great first step!

This was referenced May 16, 2024

Diagnose currently stuck head / spike to fix our head #1415

Closed

Recoverable head state via manual snapshots #1416

Closed

Post mortem of a broken head #1374

Closed

ch1bo changed the title ~~Ensure the head can not get stuck when multiple peers go offline.~~ Stress test the network reliability May 27, 2024

ch1bo assigned ffakenz Jul 10, 2024

ch1bo added the 💭 idea An idea or feature request label Jul 10, 2024

ch1bo mentioned this issue Jul 16, 2024

Prove Network Reliability and Fault Tolerance #1505

Closed

5 tasks

ffakenz mentioned this issue Jul 25, 2024

Packet loss fault injection test #1532

Closed

3 tasks

ch1bo mentioned this issue Sep 2, 2024

Spike: Use raft consensus for networking #1591

Closed

noonio closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress test the network reliability #1436

Stress test the network reliability #1436

locallycompact commented May 16, 2024 •

edited by ffakenz

Loading

noonio commented Sep 4, 2024

Stress test the network reliability #1436

Stress test the network reliability #1436

Comments

locallycompact commented May 16, 2024 • edited by ffakenz Loading

Why

What

How

noonio commented Sep 4, 2024

locallycompact commented May 16, 2024 •

edited by ffakenz

Loading