-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. [v2.10.19] #6236
Comments
Sounds like the cluster was not properly formed. Use the NATS cli to report all the servers, make sure the clustering is properly formed and all nodes have the same number of routes.
|
how do you ensure the cluster is properly formed? is this something NATS should automatically do? |
Similar Issue reported again here are some additional logs that had several warnings.
|
NATS will do its best to form the cluster, from above that looks ok when you see routes of 8 By default we mux 4 connections per server pair, so each server should have 8 in a 3 node setup, 2 x 4. Could you update to the latest server? 2.10.23? If you still see issues we would need to get on a call and triage your system in real time to understand what is adversely affecting it. |
We are still facing issue multiple times a day. Also is there any possibility that issue could be with our dotnet client, Our current framework is 4.6.2 and NATS Client version is 1.0.8. |
Lost quorum is usually a network issue since it means that the leader is not seeing or getting timely responses from enough followers to maintain a quorum. |
Also we noticed one more anomaly. Node 1:
Node 2:
Node 3:
|
Do you have something filtering connections between your pods, or were any pods down/scaling/migrating/unreachable? It looks a lot like there's some kind of network condition taking place here. |
Routine hardening is done. these are all VMs and none of them were down/scaling/migrating/unreachable. |
All NATS nodes, publisher and subscribers are in the same VLAN, we are using SDN (Cisco ACI) as a control plane. All communication is Layer2. Traffic leaves physical hypervisors towards a switch, all links are 10Gig and have no errors reported. I am also attaching ping stats from leader <> follower nodes. ping_stats-leader-node-to-leaf2.txt |
@derekcollison, what if we tradeoff fault tolerance by reducing replica to 1 as interim solution only, considering undiscovered suspected/potential network issues. Once done we shall back to replica 3. please be noted, file based persistence will remain engaged. |
Synadia can help triage your network and NATS system setup. I do not believe you are a customer but you might want to consider that. We are happy to try to help out, and love our ecosystem of users, but for triaging and diagnosing complex infrastructure setups and NATS systems we prioritize customers. And it feels like we need to get on a video call to properly make progress. |
We are in contact with the team and are considering commercial support. In the meantime, we're working to identify the underlying cause in order to find a solution. I really appreciate the support and help with this. Also today we observed and reproduced the same behavior of message loss over the test environment by introducing an artificial delay of a few seconds between nodes. We discovered that we only received 812 messages on nats out of 890 published messages. I've attached logs for your reference. nats-server3-MsgLost.log |
Once you are a customer we can jump on a vide call and triage your whole system. |
Observed behavior
Happened on the same environment setup as #6090
We are experiencing message loss in our Jetstream cluster whenever quorum is lost for certain streams and consumers. The issue primarily affects the USERS > PRIORITY_TRANS stream, but it has also been observed on USERS > WILDCARD_TRANS. The loss of quorum results in around 30 to 40 lost messages for the USERS > PRIORITY_TRANS at that given moment.
Just a bit of an overview configuration is attached.
Time stamp for these Events were 13:06:43 and 14:56:44.
As seen in the cluster logs there is indeed the leader election happened but even after catching up streams we lost several messages and after.
Other Nodes also reported this no quoram stalled error.
Expected behavior
Upon Successful leader election and catching up with streams there should be no message lost.
Server and client version
nats--server --version - 2.10.19
Host environment
3 nodes Nats Cluster.
Ubuntu 22.04 LTS.
Steps to reproduce
Random behavior was observed.
No Network hiccups were observed.
The text was updated successfully, but these errors were encountered: