You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have experienced many situations in which the head can not progress. These problems are hard to reproduce and we have spent a lot of time in coordination attempting to resolve the problem in each case.
We also want to provide a way for users with stuck heads to collect diagnostic information and submit it to the team for analysis.
How
Create a test that stress tests the network layer in the case of three or more intermittently failing peers. A failing peer is a peer that fails to send, receive or persist network messages.
(Optional) Extract the network layer into its own package to remove coupling.
The text was updated successfully, but these errors were encountered:
Why
We have experienced many situations in which the head can not progress. These problems are hard to reproduce and we have spent a lot of time in coordination attempting to resolve the problem in each case.
Records of these issues are here:
#1374
#1415
One possible solution was a manual snapshot recovery as outlined here:
#1416
This is unsatisfying as we would prefer to make the nodes self-healing and not require manual intervention.
What
We should challenge the assumptions of the reliability layer. This is currently a combination of ouroboros-network and an implementation of Logged Uniform Reliable Broadcast found in https://fileadmin.cs.lth.se/cs/Personal/Amr_Ergawy/dist-algos-slides/fourth-presentation.pdf
We also want to challenge the assumption that the on-disk persistence of the vector clock and outbound messages is actually needed. Don't persist the network messages and their acknowledgements #1417
We also want to provide a way for users with stuck heads to collect diagnostic information and submit it to the team for analysis.
How
The text was updated successfully, but these errors were encountered: