You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Franco changed his machine instance type monday morning 2024-03-25, which updated his IP address
The others did update their node configs throughout 2024-03-25 and 2023-03-26 with this new IP configuration
Today, 2024-03-26, the hydraw did not work
Observations
Latest confirmed snapshot is 228
Franco's node latest stored (in state) SnapshotRequested is 228, while the others have SnapshotRequested for 229
Judging from the sent network-messages, Arnaud was the snapshot leader for 229
Franco gets the ReqSn from Arnaud, but it results in WaitOnTxs with 523c82b607eb5e449de5cec68f0a40501cf6be19db5bc9f246bcd93502d41cd8 missing
Sasha is sending most ReqTx in this setup as he is hosting the hydraw instance
Ack counters:
Franco: [276,608,275,276,278]
Sasha: [276,615,275,276,278]
Dan: [276,615,275,276,278]
Sebastian: [276,615,275,276,278]
Sasha's network-messages only has 607 lines
Sasha's logs contain ReliabilityFailedToFindMsg
What happened?
Identify root causes and address in this or follow-up items:
Sasha's ReqTx for tx id 523c82b607eb5e449de5cec68f0a40501cf6be19db5bc9f246bcd93502d41cd8 never reached franco and his "Reliability" persistence failed?
We could manually "fix" this part of the problem by duplicating the last 8 messages in Sasha's network-messages, which would have them resent to franco's nod.
The ReqSn sent from Arnaud did reach franco, which started to WaitOnTxs, but restarting franco's node made that ReqSn disappear and never to be resent (it was acknowledged through he network acks). Consequently, the protocol is stuck because it assumes a node keeps messages received once acknowledged. But, in fact, the head logic re-enqueues inputs which it can't act on and this queue is ephemeral!
If the ReqSn would have included the transaction it snapshots (which it did once), that would be less of a problem.
Is this a combined problem of optimizing snapshot requests vs. having a coordinated protocol?
The text was updated successfully, but these errors were encountered:
ch1bo
changed the title
Post mortem of the broken head
Post mortem of a broken head
Mar 27, 2024
Situation
Observations
228
state
)SnapshotRequested
is228
, while the others haveSnapshotRequested
for229
network-messages
, Arnaud was the snapshot leader for229
ReqSn
from Arnaud, but it results inWaitOnTxs
with523c82b607eb5e449de5cec68f0a40501cf6be19db5bc9f246bcd93502d41cd8
missingReqTx
in this setup as he is hosting thehydraw
instancenetwork-messages
only has607
linesReliabilityFailedToFindMsg
What happened?
Identify root causes and address in this or follow-up items:
ReqTx
for tx id523c82b607eb5e449de5cec68f0a40501cf6be19db5bc9f246bcd93502d41cd8
never reached franco and his "Reliability" persistence failed?8
messages in Sasha'snetwork-messages
, which would have them resent to franco's nod.ReqSn
sent from Arnaud did reach franco, which started toWaitOnTxs
, but restarting franco's node made thatReqSn
disappear and never to be resent (it was acknowledged through he networkacks
). Consequently, the protocol is stuck because it assumes a node keeps messages received once acknowledged. But, in fact, the head logic re-enqueues inputs which it can't act on and this queue is ephemeral!ReqSn
would have included the transaction it snapshots (which it did once), that would be less of a problem.The text was updated successfully, but these errors were encountered: