Provide tests covering the resilience of a Hydra Head cluster #1106

ghost · 2023-10-10T13:38:40Z

Why

We have been working on improving the resilience of a Hydra Head cluster so that transient failures, whether from the network connection, a node crashing and recovering, or temporary partitions, do not lead to the Head being stuck and forces an expensive closing and reopening of the Head:

We know there are still corner cases which are not covered, but they seem to be very unlikely to occur and as these kind of issues do not breach the intrinsic safety of a Head, we think there's not much value in closing those gaps.

However:

We may be wrong in evaluating the frequency of those issues: They may occur more easily,
We don't have any concrete evaluation of how resilient is a Hydra cluster today with all those changes

What

Implement some "Chaos monkey" tests, possibly manually or semi-automated, that demonstrates a Hydra Head can survive (or not) transient random crashes, network partitions, connection drops, etc.

How

We could perhaps reuse jepsen although it's a bit complicated to setup
We did some manual tests using iptables to drop connections and manually killing nodes, but we would like a more systematic exploration to improve coverage
The hydra-cluster benchmarks would be a good basic, eg. we don't care about the L1 connectivity and we could even use a mock L1.

The text was updated successfully, but these errors were encountered:

noonio · 2024-09-04T11:15:28Z

This is effectively resolved by #1552 . We will continue to iterate on what we have, but we've taken a great first step!

ghost added network green 💚 Low complexity or well understood feature task Subtask of a bigger feature. L2 Affect off-chain part of the Head protocol/network labels Oct 10, 2023

ch1bo added 💭 idea An idea or feature request and removed green 💚 Low complexity or well understood feature task Subtask of a bigger feature. labels Jun 18, 2024

ch1bo changed the title ~~Provide some tests demonstrating the resilience of a Hydra Head cluster~~ Provide tests covering the resilience of a Hydra Head cluster Jun 18, 2024

ch1bo mentioned this issue Sep 2, 2024

Spike: Use raft consensus for networking #1591

Closed

noonio closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide tests covering the resilience of a Hydra Head cluster #1106

Provide tests covering the resilience of a Hydra Head cluster #1106

ghost commented Oct 10, 2023

noonio commented Sep 4, 2024

Provide tests covering the resilience of a Hydra Head cluster #1106

Provide tests covering the resilience of a Hydra Head cluster #1106

Comments

ghost commented Oct 10, 2023

Why

What

How

noonio commented Sep 4, 2024