You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have been working on improving the resilience of a Hydra Head cluster so that transient failures, whether from the network connection, a node crashing and recovering, or temporary partitions, do not lead to the Head being stuck and forces an expensive closing and reopening of the Head:
We know there are still corner cases which are not covered, but they seem to be very unlikely to occur and as these kind of issues do not breach the intrinsic safety of a Head, we think there's not much value in closing those gaps.
However:
We may be wrong in evaluating the frequency of those issues: They may occur more easily,
We don't have any concrete evaluation of how resilient is a Hydra cluster today with all those changes
What
Implement some "Chaos monkey" tests, possibly manually or semi-automated, that demonstrates a Hydra Head can survive (or not) transient random crashes, network partitions, connection drops, etc.
How
We could perhaps reuse jepsen although it's a bit complicated to setup
We did some manual tests using iptables to drop connections and manually killing nodes, but we would like a more systematic exploration to improve coverage
The hydra-cluster benchmarks would be a good basic, eg. we don't care about the L1 connectivity and we could even use a mock L1.
The text was updated successfully, but these errors were encountered:
ghost
added
network
green 💚
Low complexity or well understood feature
task
Subtask of a bigger feature.
L2
Affect off-chain part of the Head protocol/network
labels
Oct 10, 2023
ch1bo
added
💭 idea
An idea or feature request
and removed
green 💚
Low complexity or well understood feature
task
Subtask of a bigger feature.
labels
Jun 18, 2024
ch1bo
changed the title
Provide some tests demonstrating the resilience of a Hydra Head cluster
Provide tests covering the resilience of a Hydra Head cluster
Jun 18, 2024
Why
We have been working on improving the resilience of a Hydra Head cluster so that transient failures, whether from the network connection, a node crashing and recovering, or temporary partitions, do not lead to the Head being stuck and forces an expensive closing and reopening of the Head:
We know there are still corner cases which are not covered, but they seem to be very unlikely to occur and as these kind of issues do not breach the intrinsic safety of a Head, we think there's not much value in closing those gaps.
However:
What
Implement some "Chaos monkey" tests, possibly manually or semi-automated, that demonstrates a Hydra Head can survive (or not) transient random crashes, network partitions, connection drops, etc.
How
iptables
to drop connections and manually killing nodes, but we would like a more systematic exploration to improve coveragehydra-cluster
benchmarks would be a good basic, eg. we don't care about the L1 connectivity and we could even use a mock L1.The text was updated successfully, but these errors were encountered: