Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. [v2.10.19] #6236

Open
NoumanNawaz51 opened this issue Dec 10, 2024 · 15 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@NoumanNawaz51
Copy link

NoumanNawaz51 commented Dec 10, 2024

Observed behavior

Happened on the same environment setup as #6090

We are experiencing message loss in our Jetstream cluster whenever quorum is lost for certain streams and consumers. The issue primarily affects the USERS > PRIORITY_TRANS stream, but it has also been observed on USERS > WILDCARD_TRANS. The loss of quorum results in around 30 to 40 lost messages for the USERS > PRIORITY_TRANS at that given moment.

  • During the day we also saw multiple messages lost randomly with hours of interval.

Just a bit of an overview configuration is attached.

  • Number of Nodes 3
  • Replicas 3
  • Storage File based
  • Retention Policy Work Queue
  • Not allowed any rollups or auto deletion or limit

Time stamp for these Events were 13:06:43 and 14:56:44.
As seen in the cluster logs there is indeed the leader election happened but even after catching up streams we lost several messages and after.

[750565] 2024/12/10 07:06:48.053713 [WRN] JetStream cluster consumer 'USERS > COMMON > COMMONR_EASY_6D' has NO quorum, stalled.
[750565] 2024/12/10 07:06:49.172011 [WRN] JetStream cluster stream 'USERS > COMMON' has NO quorum, stalled
[750565] 2024/12/10 07:06:50.146342 [WRN] JetStream cluster consumer 'USERS > COMMON > COMMONR_EASYD' has NO quorum, stalled.
[750565] 2024/12/10 07:06:51.792598 [WRN] JetStream cluster stream 'USERS > SEPERATE' has NO quorum, stalled
[750565] 2024/12/10 07:06:53.675972 [WRN] JetStream cluster consumer 'USERS > COMMON > COMMONSAFENET_SUB_PAYD' has NO quorum, stalled.
[750565] 2024/12/10 07:06:55.724170 [INF] JetStream cluster new consumer leader for 'USERS > COMMON > COMMONI_REQD'
[750565] 2024/12/10 07:06:55.724455 [INF] JetStream cluster new consumer leader for 'USERS > SEPERATE > SEPERATER_DC_EASYD'
[750565] 2024/12/10 07:06:55.724843 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_2_2'
[750565] 2024/12/10 07:06:55.725173 [INF] JetStream cluster new stream leader for 'USERS > SEPERATE'
[750565] 2024/12/10 07:07:00.211853 [INF] JetStream cluster new consumer leader for 'USERS > COMMON > COMMONR_EASY_6D'
[750565] 2024/12/10 07:07:19.195635 [INF] JetStream cluster new consumer leader for 'USERS > COMMON > COMMOND_EASY_8D'
[750565] 2024/12/10 07:07:19.361885 [INF] JetStream cluster new consumer leader for 'USERS > COMMON > COMMONR_DC_DBD'
[750565] 2024/12/10 07:07:23.167177 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_1_3'
[750565] 2024/12/10 07:07:23.343719 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_3_1'
[750565] 2024/12/10 07:07:23.345870 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_1_1'
[750565] 2024/12/10 07:07:23.7538 [WRN] RAFT [D72AFGGG - C-RT5-OILK456] Resetting WAL state
[750565] 2024/12/10 07:07:24.022377 [INF] JetStream cluster new stream leader for 'USERS > SEPERATE'
[750565] 2024/12/10 07:07:26.925922 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_3_2'
[750565] 2024/12/10 07:07:26.928570 [INF] Self is new JetStream cluster metadata leader

Other Nodes also reported this no quoram stalled error.

Nats Cluster Lookup

Expected behavior

Upon Successful leader election and catching up with streams there should be no message lost.

Server and client version

nats--server --version - 2.10.19

Host environment

3 nodes Nats Cluster.
Ubuntu 22.04 LTS.

Steps to reproduce

Random behavior was observed.
No Network hiccups were observed.

@derekcollison
Copy link
Member

Sounds like the cluster was not properly formed. Use the NATS cli to report all the servers, make sure the clustering is properly formed and all nodes have the same number of routes.

nats server ls from the system account.

@wallyqs wallyqs changed the title Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. [v2.10.19] Dec 11, 2024
@eskibla
Copy link

eskibla commented Dec 11, 2024

clustering is properly formed

how do you ensure the cluster is properly formed? is this something NATS should automatically do?

@NoumanNawaz51
Copy link
Author

Node 1
node1 report

Node2
node2 report

Node3
node3 report

@NoumanNawaz51
Copy link
Author

NoumanNawaz51 commented Dec 11, 2024

Similar Issue reported again here are some additional logs that had several warnings.

Line 4830: [2197907] 2024/12/09 03:16:53.706675 [WRN] Catchup for stream 'USERS > COMMON' resetting first sequence:
32041938 on catchup request
Line 10441: [793545] 2024/12/10 16:54:39.583088 [WRN] Consumer 'USERS > SEPERATE > SEPERATEN_DC_SDB' error on store update from snapshot entry: bad pending entry, sequence [5957709] out of range
Line 10443: [793545] 2024/12/10 16:54:39.617248 [WRN] Waiting for routing to be established...
Line 10445: [793545] 2024/12/10 16:54:39.644617 [WRN] JetStream has not established contact with a meta leader

@derekcollison
Copy link
Member

NATS will do its best to form the cluster, from above that looks ok when you see routes of 8 By default we mux 4 connections per server pair, so each server should have 8 in a 3 node setup, 2 x 4.

Could you update to the latest server? 2.10.23?

If you still see issues we would need to get on a call and triage your system in real time to understand what is adversely affecting it.

@NoumanNawaz51
Copy link
Author

We are still facing issue multiple times a day. Also is there any possibility that issue could be with our dotnet client, Our current framework is 4.6.2 and NATS Client version is 1.0.8.

@derekcollison
Copy link
Member

Lost quorum is usually a network issue since it means that the leader is not seeing or getting timely responses from enough followers to maintain a quorum.

@NoumanNawaz51
Copy link
Author

Also we noticed one more anomaly.

Node 1:

[977] 2024/12/13 02:34:38.710608 [INF] 10.1.18.182:6222 - rid:114 - Route connection created
[977] 2024/12/13 02:34:38.710782 [INF] 10.1.18.182:6222 - rid:113 - Router connection closed: Duplicate Route
[977] 2024/12/13 02:34:38.710890 [INF] 10.1.18.182:6222 - rid:114 - Router connection closed: Duplicate Route
[977] 2024/12/13 02:34:38.794241 [INF] 10.1.18.182:6222 - rid:116 - Route connection created
[977] 2024/12/13 02:34:38.794260 [INF] 10.1.18.182:6222 - rid:115 - Route connection created
[977] 2024/12/13 02:34:38.794458 [INF] 10.1.18.182:6222 - rid:116 - Router connection closed: Duplicate Route
[977] 2024/12/13 02:34:38.794506 [INF] 10.1.18.182:6222 - rid:115 - Router connection closed: Duplicate Route
[977] 2024/12/13 07:36:51.258131 [INF] JetStream cluster new stream leader for 'USERS > XXX'
[977] 2024/12/13 08:22:08.382213 [WRN] JetStream cluster stream 'USERS > XXX' has NO quorum, stalled

Node 2:

[985] 2024/12/13 02:34:37.777969 [ERR] Error trying to connect to route (attempt 32): dial tcp 10.1.18.182:6222: connect: connection refused
[985] 2024/12/13 02:34:37.975718 [INF] 10.1.18.182:58650 - rid:PRIORITY - Route connection created
[985] 2024/12/13 02:34:37.975779 [INF] 10.1.18.182:58644 - rid:104 - Route connection created
[985] 2024/12/13 02:34:37.992346 [INF] 10.1.18.182:58652 - rid:105 - Route connection created
[985] 2024/12/13 02:34:38.038365 [INF] 10.1.18.182:58656 - rid:106 - Route connection created
[985] 2024/12/13 02:34:38.713022 [INF] 10.1.18.182:6222 - rid:107 - Route connection created
[985] 2024/12/13 02:34:38.713191 [INF] 10.1.18.182:6222 - rid:107 - Router connection closed: Duplicate Route
[985] 2024/12/13 02:34:38.727564 [INF] 10.1.18.182:6222 - rid:108 - Route connection created
[985] 2024/12/13 02:34:38.727732 [INF] 10.1.18.182:6222 - rid:108 - Router connection closed: Duplicate Route
[985] 2024/12/13 02:34:38.749028 [INF] 10.1.18.182:6222 - rid:109 - Route connection created
[985] 2024/12/13 02:34:38.749232 [INF] 10.1.18.182:6222 - rid:109 - Router connection closed: Duplicate Route
[985] 2024/12/13 02:34:38.778529 [INF] 10.1.18.182:6222 - rid:110 - Route connection created
[985] 2024/12/13 02:34:38.778747 [INF] 10.1.18.182:6222 - rid:110 - Router connection closed: Duplicate Route
[985] 2024/12/13 08:21:52.515148 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_XDurable'
[985] 2024/12/13 08:21:53.353890 [INF] JetStream cluster new stream leader for 'USERS > XXX'
[985] 2024/12/13 08:22:08.948743 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_X_OOODurable'
[985] 2024/12/13 12:36:54.623229 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_X_196Durable'
[985] 2024/12/13 12:37:09.405746 [INF] JetStream cluster new stream leader for 'USERS > PRIORITY'

Node 3:

[989] 2024/12/13 02:34:38.751933 [INF] 10.1.18.124:57928 - rid:95 - Route connection created
[989] 2024/12/13 02:34:38.752117 [INF] 10.1.18.124:57928 - rid:95 - Router connection closed: Client Closed
[989] 2024/12/13 02:34:38.781442 [INF] 10.1.18.124:57938 - rid:96 - Route connection created
[989] 2024/12/13 02:34:38.781676 [INF] 10.1.18.124:57938 - rid:96 - Router connection closed: Duplicate Route
[989] 2024/12/13 02:34:38.799458 [INF] 10.1.18.112:60348 - rid:97 - Route connection created
[989] 2024/12/13 02:34:38.799489 [INF] 10.1.18.112:60336 - rid:98 - Route connection created
[989] 2024/12/13 02:34:38.799674 [INF] 10.1.18.112:60348 - rid:97 - Router connection closed: Client Closed
[989] 2024/12/13 02:34:38.799714 [INF] 10.1.18.112:60336 - rid:98 - Router connection closed: Client Closed
[989] 2024/12/13 07:37:09.271095 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXDNS_X_198Durable'
[989] 2024/12/13 07:37:09.321685 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXDNS_XDurable'
[989] 2024/12/13 07:37:09.408001 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_X_OOODurable'
[989] 2024/12/13 08:21:52.846826 [INF] JetStream cluster new stream leader for 'USERS > PRIORITY'
[989] 2024/12/13 08:22:08.774605 [INF] JetStream cluster new consumer leader for 'USERS > PRIORITY > PRIORITYY_XXX_XDurable'
[989] 2024/12/13 12:36:54.593148 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_XXX_DATABASEDurable'
[989] 2024/12/13 12:37:09.481499 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_X_OOODurable'

@neilalexander
Copy link
Member

Error trying to connect to route (attempt 32): dial tcp 10.1.18.182:6222: connect: connection refused

Do you have something filtering connections between your pods, or were any pods down/scaling/migrating/unreachable? It looks a lot like there's some kind of network condition taking place here.

@NoumanNawaz51
Copy link
Author

Error trying to connect to route (attempt 32): dial tcp 10.1.18.182:6222: connect: connection refused

Do you have something filtering connections between your pods, or were any pods down/scaling/migrating/unreachable? It looks a lot like there's some kind of network condition taking place here.

Routine hardening is done. these are all VMs and none of them were down/scaling/migrating/unreachable.
Ping is working on all of them.

@NoumanNawaz51
Copy link
Author

Error trying to connect to route (attempt 32): dial tcp 10.1.18.182:6222: connect: connection refused

Do you have something filtering connections between your pods, or were any pods down/scaling/migrating/unreachable? It looks a lot like there's some kind of network condition taking place here.

Routine hardening is done. these are all VMs and none of them were down/scaling/migrating/unreachable. Ping is working on all of them.

All NATS nodes, publisher and subscribers are in the same VLAN, we are using SDN (Cisco ACI) as a control plane. All communication is Layer2. Traffic leaves physical hypervisors towards a switch, all links are 10Gig and have no errors reported.

I am also attaching ping stats from leader <> follower nodes.

ping_stats-leader-node-to-leaf2.txt
ping_stats-leaf1-node-to-leader.txt

@NoumanNawaz51
Copy link
Author

@derekcollison, what if we tradeoff fault tolerance by reducing replica to 1 as interim solution only, considering undiscovered suspected/potential network issues. Once done we shall back to replica 3. please be noted, file based persistence will remain engaged.

@derekcollison
Copy link
Member

Synadia can help triage your network and NATS system setup. I do not believe you are a customer but you might want to consider that. We are happy to try to help out, and love our ecosystem of users, but for triaging and diagnosing complex infrastructure setups and NATS systems we prioritize customers. And it feels like we need to get on a video call to properly make progress.

@NoumanNawaz51
Copy link
Author

Synadia can help triage your network and NATS system setup. I do not believe you are a customer but you might want to consider that. We are happy to try to help out, and love our ecosystem of users, but for triaging and diagnosing complex infrastructure setups and NATS systems we prioritize customers. And it feels like we need to get on a video call to properly make progress.

We are in contact with the team and are considering commercial support. In the meantime, we're working to identify the underlying cause in order to find a solution. I really appreciate the support and help with this.

Also today we observed and reproduced the same behavior of message loss over the test environment by introducing an artificial delay of a few seconds between nodes. We discovered that we only received 812 messages on nats out of 890 published messages. I've attached logs for your reference.

nats-server3-MsgLost.log
nats-server1-MsgLost.log
nats-server2-MsgLost.log

@derekcollison
Copy link
Member

Once you are a customer we can jump on a vide call and triage your whole system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

4 participants