Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. [v2.10.19] #6236

NoumanNawaz51 · 2024-12-10T10:09:38Z

Observed behavior

Happened on the same environment setup as #6090

We are experiencing message loss in our Jetstream cluster whenever quorum is lost for certain streams and consumers. The issue primarily affects the USERS > PRIORITY_TRANS stream, but it has also been observed on USERS > WILDCARD_TRANS. The loss of quorum results in around 30 to 40 lost messages for the USERS > PRIORITY_TRANS at that given moment.

During the day we also saw multiple messages lost randomly with hours of interval.

Just a bit of an overview configuration is attached.

Number of Nodes 3
Replicas 3
Storage File based
Retention Policy Work Queue
Not allowed any rollups or auto deletion or limit

Time stamp for these Events were 13:06:43 and 14:56:44.
As seen in the cluster logs there is indeed the leader election happened but even after catching up streams we lost several messages and after.

[750565] 2024/12/10 07:06:48.053713 [WRN] JetStream cluster consumer 'USERS > COMMON > COMMONR_EASY_6D' has NO quorum, stalled.
[750565] 2024/12/10 07:06:49.172011 [WRN] JetStream cluster stream 'USERS > COMMON' has NO quorum, stalled
[750565] 2024/12/10 07:06:50.146342 [WRN] JetStream cluster consumer 'USERS > COMMON > COMMONR_EASYD' has NO quorum, stalled.
[750565] 2024/12/10 07:06:51.792598 [WRN] JetStream cluster stream 'USERS > SEPERATE' has NO quorum, stalled
[750565] 2024/12/10 07:06:53.675972 [WRN] JetStream cluster consumer 'USERS > COMMON > COMMONSAFENET_SUB_PAYD' has NO quorum, stalled.
[750565] 2024/12/10 07:06:55.724170 [INF] JetStream cluster new consumer leader for 'USERS > COMMON > COMMONI_REQD'
[750565] 2024/12/10 07:06:55.724455 [INF] JetStream cluster new consumer leader for 'USERS > SEPERATE > SEPERATER_DC_EASYD'
[750565] 2024/12/10 07:06:55.724843 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_2_2'
[750565] 2024/12/10 07:06:55.725173 [INF] JetStream cluster new stream leader for 'USERS > SEPERATE'
[750565] 2024/12/10 07:07:00.211853 [INF] JetStream cluster new consumer leader for 'USERS > COMMON > COMMONR_EASY_6D'
[750565] 2024/12/10 07:07:19.195635 [INF] JetStream cluster new consumer leader for 'USERS > COMMON > COMMOND_EASY_8D'
[750565] 2024/12/10 07:07:19.361885 [INF] JetStream cluster new consumer leader for 'USERS > COMMON > COMMONR_DC_DBD'
[750565] 2024/12/10 07:07:23.167177 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_1_3'
[750565] 2024/12/10 07:07:23.343719 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_3_1'
[750565] 2024/12/10 07:07:23.345870 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_1_1'
[750565] 2024/12/10 07:07:23.7538 [WRN] RAFT [D72AFGGG - C-RT5-OILK456] Resetting WAL state
[750565] 2024/12/10 07:07:24.022377 [INF] JetStream cluster new stream leader for 'USERS > SEPERATE'
[750565] 2024/12/10 07:07:26.925922 [INF] JetStream cluster new consumer leader for 'USERS > CHANNEL > CHANNELD_3_2'
[750565] 2024/12/10 07:07:26.928570 [INF] Self is new JetStream cluster metadata leader

Other Nodes also reported this no quoram stalled error.

Expected behavior

Upon Successful leader election and catching up with streams there should be no message lost.

Server and client version

nats--server --version - 2.10.19

Host environment

3 nodes Nats Cluster.
Ubuntu 22.04 LTS.

Steps to reproduce

Random behavior was observed.
No Network hiccups were observed.

derekcollison · 2024-12-10T14:45:42Z

Sounds like the cluster was not properly formed. Use the NATS cli to report all the servers, make sure the clustering is properly formed and all nodes have the same number of routes.

nats server ls from the system account.

eskibla · 2024-12-11T08:23:02Z

clustering is properly formed

how do you ensure the cluster is properly formed? is this something NATS should automatically do?

NoumanNawaz51 · 2024-12-11T10:20:06Z

Node 1

Node2

Node3

NoumanNawaz51 · 2024-12-11T10:49:46Z

Similar Issue reported again here are some additional logs that had several warnings.

Line 4830: [2197907] 2024/12/09 03:16:53.706675 [WRN] Catchup for stream 'USERS > COMMON' resetting first sequence:
32041938 on catchup request
Line 10441: [793545] 2024/12/10 16:54:39.583088 [WRN] Consumer 'USERS > SEPERATE > SEPERATEN_DC_SDB' error on store update from snapshot entry: bad pending entry, sequence [5957709] out of range
Line 10443: [793545] 2024/12/10 16:54:39.617248 [WRN] Waiting for routing to be established...
Line 10445: [793545] 2024/12/10 16:54:39.644617 [WRN] JetStream has not established contact with a meta leader

derekcollison · 2024-12-11T23:36:42Z

NATS will do its best to form the cluster, from above that looks ok when you see routes of 8 By default we mux 4 connections per server pair, so each server should have 8 in a 3 node setup, 2 x 4.

Could you update to the latest server? 2.10.23?

If you still see issues we would need to get on a call and triage your system in real time to understand what is adversely affecting it.

NoumanNawaz51 · 2024-12-13T11:14:30Z

We are still facing issue multiple times a day. Also is there any possibility that issue could be with our dotnet client, Our current framework is 4.6.2 and NATS Client version is 1.0.8.

derekcollison · 2024-12-13T11:18:30Z

Lost quorum is usually a network issue since it means that the leader is not seeing or getting timely responses from enough followers to maintain a quorum.

NoumanNawaz51 · 2024-12-13T12:39:40Z

Also we noticed one more anomaly.

Node 1:

[977] 2024/12/13 02:34:38.710608 [INF] 10.1.18.182:6222 - rid:114 - Route connection created
[977] 2024/12/13 02:34:38.710782 [INF] 10.1.18.182:6222 - rid:113 - Router connection closed: Duplicate Route
[977] 2024/12/13 02:34:38.710890 [INF] 10.1.18.182:6222 - rid:114 - Router connection closed: Duplicate Route
[977] 2024/12/13 02:34:38.794241 [INF] 10.1.18.182:6222 - rid:116 - Route connection created
[977] 2024/12/13 02:34:38.794260 [INF] 10.1.18.182:6222 - rid:115 - Route connection created
[977] 2024/12/13 02:34:38.794458 [INF] 10.1.18.182:6222 - rid:116 - Router connection closed: Duplicate Route
[977] 2024/12/13 02:34:38.794506 [INF] 10.1.18.182:6222 - rid:115 - Router connection closed: Duplicate Route
[977] 2024/12/13 07:36:51.258131 [INF] JetStream cluster new stream leader for 'USERS > XXX'
[977] 2024/12/13 08:22:08.382213 [WRN] JetStream cluster stream 'USERS > XXX' has NO quorum, stalled

Node 2:

[985] 2024/12/13 02:34:37.777969 [ERR] Error trying to connect to route (attempt 32): dial tcp 10.1.18.182:6222: connect: connection refused
[985] 2024/12/13 02:34:37.975718 [INF] 10.1.18.182:58650 - rid:PRIORITY - Route connection created
[985] 2024/12/13 02:34:37.975779 [INF] 10.1.18.182:58644 - rid:104 - Route connection created
[985] 2024/12/13 02:34:37.992346 [INF] 10.1.18.182:58652 - rid:105 - Route connection created
[985] 2024/12/13 02:34:38.038365 [INF] 10.1.18.182:58656 - rid:106 - Route connection created
[985] 2024/12/13 02:34:38.713022 [INF] 10.1.18.182:6222 - rid:107 - Route connection created
[985] 2024/12/13 02:34:38.713191 [INF] 10.1.18.182:6222 - rid:107 - Router connection closed: Duplicate Route
[985] 2024/12/13 02:34:38.727564 [INF] 10.1.18.182:6222 - rid:108 - Route connection created
[985] 2024/12/13 02:34:38.727732 [INF] 10.1.18.182:6222 - rid:108 - Router connection closed: Duplicate Route
[985] 2024/12/13 02:34:38.749028 [INF] 10.1.18.182:6222 - rid:109 - Route connection created
[985] 2024/12/13 02:34:38.749232 [INF] 10.1.18.182:6222 - rid:109 - Router connection closed: Duplicate Route
[985] 2024/12/13 02:34:38.778529 [INF] 10.1.18.182:6222 - rid:110 - Route connection created
[985] 2024/12/13 02:34:38.778747 [INF] 10.1.18.182:6222 - rid:110 - Router connection closed: Duplicate Route
[985] 2024/12/13 08:21:52.515148 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_XDurable'
[985] 2024/12/13 08:21:53.353890 [INF] JetStream cluster new stream leader for 'USERS > XXX'
[985] 2024/12/13 08:22:08.948743 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_X_OOODurable'
[985] 2024/12/13 12:36:54.623229 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_X_196Durable'
[985] 2024/12/13 12:37:09.405746 [INF] JetStream cluster new stream leader for 'USERS > PRIORITY'

Node 3:

[989] 2024/12/13 02:34:38.751933 [INF] 10.1.18.124:57928 - rid:95 - Route connection created
[989] 2024/12/13 02:34:38.752117 [INF] 10.1.18.124:57928 - rid:95 - Router connection closed: Client Closed
[989] 2024/12/13 02:34:38.781442 [INF] 10.1.18.124:57938 - rid:96 - Route connection created
[989] 2024/12/13 02:34:38.781676 [INF] 10.1.18.124:57938 - rid:96 - Router connection closed: Duplicate Route
[989] 2024/12/13 02:34:38.799458 [INF] 10.1.18.112:60348 - rid:97 - Route connection created
[989] 2024/12/13 02:34:38.799489 [INF] 10.1.18.112:60336 - rid:98 - Route connection created
[989] 2024/12/13 02:34:38.799674 [INF] 10.1.18.112:60348 - rid:97 - Router connection closed: Client Closed
[989] 2024/12/13 02:34:38.799714 [INF] 10.1.18.112:60336 - rid:98 - Router connection closed: Client Closed
[989] 2024/12/13 07:37:09.271095 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXDNS_X_198Durable'
[989] 2024/12/13 07:37:09.321685 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXDNS_XDurable'
[989] 2024/12/13 07:37:09.408001 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_X_OOODurable'
[989] 2024/12/13 08:21:52.846826 [INF] JetStream cluster new stream leader for 'USERS > PRIORITY'
[989] 2024/12/13 08:22:08.774605 [INF] JetStream cluster new consumer leader for 'USERS > PRIORITY > PRIORITYY_XXX_XDurable'
[989] 2024/12/13 12:36:54.593148 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_XXX_DATABASEDurable'
[989] 2024/12/13 12:37:09.481499 [INF] JetStream cluster new consumer leader for 'USERS > XXX > XXXY_X_OOODurable'

neilalexander · 2024-12-13T12:43:26Z

Error trying to connect to route (attempt 32): dial tcp 10.1.18.182:6222: connect: connection refused

Do you have something filtering connections between your pods, or were any pods down/scaling/migrating/unreachable? It looks a lot like there's some kind of network condition taking place here.

NoumanNawaz51 · 2024-12-13T13:18:50Z

Error trying to connect to route (attempt 32): dial tcp 10.1.18.182:6222: connect: connection refused

Do you have something filtering connections between your pods, or were any pods down/scaling/migrating/unreachable? It looks a lot like there's some kind of network condition taking place here.

Routine hardening is done. these are all VMs and none of them were down/scaling/migrating/unreachable.
Ping is working on all of them.

NoumanNawaz51 · 2024-12-14T13:17:46Z

Error trying to connect to route (attempt 32): dial tcp 10.1.18.182:6222: connect: connection refused

Do you have something filtering connections between your pods, or were any pods down/scaling/migrating/unreachable? It looks a lot like there's some kind of network condition taking place here.

Routine hardening is done. these are all VMs and none of them were down/scaling/migrating/unreachable. Ping is working on all of them.

All NATS nodes, publisher and subscribers are in the same VLAN, we are using SDN (Cisco ACI) as a control plane. All communication is Layer2. Traffic leaves physical hypervisors towards a switch, all links are 10Gig and have no errors reported.

I am also attaching ping stats from leader <> follower nodes.

ping_stats-leader-node-to-leaf2.txt
ping_stats-leaf1-node-to-leader.txt

NoumanNawaz51 · 2024-12-14T13:26:09Z

@derekcollison, what if we tradeoff fault tolerance by reducing replica to 1 as interim solution only, considering undiscovered suspected/potential network issues. Once done we shall back to replica 3. please be noted, file based persistence will remain engaged.

derekcollison · 2024-12-14T16:39:25Z

Synadia can help triage your network and NATS system setup. I do not believe you are a customer but you might want to consider that. We are happy to try to help out, and love our ecosystem of users, but for triaging and diagnosing complex infrastructure setups and NATS systems we prioritize customers. And it feels like we need to get on a video call to properly make progress.

NoumanNawaz51 · 2024-12-15T11:04:00Z

Synadia can help triage your network and NATS system setup. I do not believe you are a customer but you might want to consider that. We are happy to try to help out, and love our ecosystem of users, but for triaging and diagnosing complex infrastructure setups and NATS systems we prioritize customers. And it feels like we need to get on a video call to properly make progress.

We are in contact with the team and are considering commercial support. In the meantime, we're working to identify the underlying cause in order to find a solution. I really appreciate the support and help with this.

Also today we observed and reproduced the same behavior of message loss over the test environment by introducing an artificial delay of a few seconds between nodes. We discovered that we only received 812 messages on nats out of 890 published messages. I've attached logs for your reference.

nats-server3-MsgLost.log
nats-server1-MsgLost.log
nats-server2-MsgLost.log

derekcollison · 2024-12-15T17:31:33Z

Once you are a customer we can jump on a vide call and triage your whole system.

NoumanNawaz51 added the defect Suspected defect such as a bug or regression label Dec 10, 2024

NoumanNawaz51 mentioned this issue Dec 10, 2024

Significant Message Delivery Delays in NATS 3-Node Cluster WQ Despite Low Queue Volume and Idle Consumers [v2.10.19] #6090

Open

wallyqs changed the title ~~Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error.~~ Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. [v2.10.19] Dec 11, 2024

github-actions bot added the stale This issue has had no activity in a while label Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. [v2.10.19] #6236

Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. [v2.10.19] #6236

NoumanNawaz51 commented Dec 10, 2024 •

edited

Loading

derekcollison commented Dec 10, 2024

eskibla commented Dec 11, 2024

NoumanNawaz51 commented Dec 11, 2024

NoumanNawaz51 commented Dec 11, 2024 •

edited

Loading

derekcollison commented Dec 11, 2024

NoumanNawaz51 commented Dec 13, 2024

derekcollison commented Dec 13, 2024

NoumanNawaz51 commented Dec 13, 2024

neilalexander commented Dec 13, 2024

NoumanNawaz51 commented Dec 13, 2024

NoumanNawaz51 commented Dec 14, 2024

NoumanNawaz51 commented Dec 14, 2024

derekcollison commented Dec 14, 2024

NoumanNawaz51 commented Dec 15, 2024

derekcollison commented Dec 15, 2024

Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. [v2.10.19] #6236

Jetstream Messages Lost on stream with "No Quorum has Stalled" for each stream error. [v2.10.19] #6236

Comments

NoumanNawaz51 commented Dec 10, 2024 • edited Loading

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

derekcollison commented Dec 10, 2024

eskibla commented Dec 11, 2024

NoumanNawaz51 commented Dec 11, 2024

NoumanNawaz51 commented Dec 11, 2024 • edited Loading

derekcollison commented Dec 11, 2024

NoumanNawaz51 commented Dec 13, 2024

derekcollison commented Dec 13, 2024

NoumanNawaz51 commented Dec 13, 2024

neilalexander commented Dec 13, 2024

NoumanNawaz51 commented Dec 13, 2024

NoumanNawaz51 commented Dec 14, 2024

NoumanNawaz51 commented Dec 14, 2024

derekcollison commented Dec 14, 2024

NoumanNawaz51 commented Dec 15, 2024

derekcollison commented Dec 15, 2024

NoumanNawaz51 commented Dec 10, 2024 •

edited

Loading

NoumanNawaz51 commented Dec 11, 2024 •

edited

Loading