Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide tests covering the resilience of a Hydra Head cluster #1106

Closed
ghost opened this issue Oct 10, 2023 · 1 comment
Closed

Provide tests covering the resilience of a Hydra Head cluster #1106

ghost opened this issue Oct 10, 2023 · 1 comment
Labels
L2 Affect off-chain part of the Head protocol/network network 💭 idea An idea or feature request

Comments

@ghost
Copy link

ghost commented Oct 10, 2023

Why

We have been working on improving the resilience of a Hydra Head cluster so that transient failures, whether from the network connection, a node crashing and recovering, or temporary partitions, do not lead to the Head being stuck and forces an expensive closing and reopening of the Head:

We know there are still corner cases which are not covered, but they seem to be very unlikely to occur and as these kind of issues do not breach the intrinsic safety of a Head, we think there's not much value in closing those gaps.

However:

  • We may be wrong in evaluating the frequency of those issues: They may occur more easily,
  • We don't have any concrete evaluation of how resilient is a Hydra cluster today with all those changes

What

Implement some "Chaos monkey" tests, possibly manually or semi-automated, that demonstrates a Hydra Head can survive (or not) transient random crashes, network partitions, connection drops, etc.

How

  • We could perhaps reuse jepsen although it's a bit complicated to setup
  • We did some manual tests using iptables to drop connections and manually killing nodes, but we would like a more systematic exploration to improve coverage
  • The hydra-cluster benchmarks would be a good basic, eg. we don't care about the L1 connectivity and we could even use a mock L1.
@ghost ghost added network green 💚 Low complexity or well understood feature task Subtask of a bigger feature. L2 Affect off-chain part of the Head protocol/network labels Oct 10, 2023
@ch1bo ch1bo added 💭 idea An idea or feature request and removed green 💚 Low complexity or well understood feature task Subtask of a bigger feature. labels Jun 18, 2024
@ch1bo ch1bo changed the title Provide some tests demonstrating the resilience of a Hydra Head cluster Provide tests covering the resilience of a Hydra Head cluster Jun 18, 2024
@noonio
Copy link
Contributor

noonio commented Sep 4, 2024

This is effectively resolved by #1552 . We will continue to iterate on what we have, but we've taken a great first step!

@noonio noonio closed this as completed Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
L2 Affect off-chain part of the Head protocol/network network 💭 idea An idea or feature request
Projects
None yet
Development

No branches or pull requests

2 participants