Quorum queues can enter a state it cannot recover from due to a timeout #13827

matthew-s-walker · 2025-04-29T12:42:15Z

matthew-s-walker
Apr 29, 2025

Describe the bug

Hi,

Firstly, I want to thank you for your work on RabbitMQ. It has been a rock solid core component of our system for many years.

We migrated all of our queues to the quorum queue type recently but have unfortunately encountered stability problems in our production environment.
Our system creates temporary queues, often up to 50 across 1 second or so, and totalling roughly 20000 per day.

After migrating, we found that within a few hours some queues (typically several created at similar times) will go into a state where:

Publishing to the queue fails
Consumers from the queue think that they are consuming but are not shown as consuming in the management UI, or receive internal errors attempting to consume.
The issue either occurs immediately after/during creation or within 2-3 minutes of creation.

We can reproduce the behaviour on the following versions of RabbitMQ, but the errors logged by the servers are different in at least 4.1.0:

3.12.14
3.13.7
4.0.1
4.1.0

On 4.0.1 and below, we receive various "badmatch" "timeout" errors, which I can provide if wanted.

Our cluster setup is:

Kubernetes (EKS) with the cluster and topology operators
3 nodes, 2 CPUs allocated, 4Gi memory
10Gi EBS volumes (GP2)

Typical cluster load is < 1000 total queues, < 500 total messages per second. The vast majority of messages are < 4KiB.

The issue reproduces with:

Both continuous_membership_reconciliation enabled and disabled
channel_operation_timeout set to 60000 and set to the default
3 node and single node clusters

Here is an example of a queue going into a bad state with 4.1.0 (I am happy to provide logs from earlier versions as well):

2025-04-23 10:18:48.261339+00:00 [debug] <0.5779956.0> Will start up to 3 replicas for quorum queue queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido' with leader on node 'rabbit@rabbitmq-server-0.rabbitmq-nodes.xxxxx', initial machine version 5
2025-04-23 10:18:48.628726+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
2025-04-23 10:18:48.629252+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
2025-04-23 10:18:48.629809+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Generic server <0.5779956.0> terminating
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Last message in was {'$gen_cast',
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                            {method,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                {'queue.declare',0,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    false,true,false,false,false,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    [{<<"x-expires">>,signedint,7200000},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                     {<<"x-queue-type">>,longstr,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                      <<"quorum">>}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                none,noflow}}
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** When Server state == {ch,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {conf,running,rabbit_framing_amqp_0_9_1,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           <0.5779924.0>,<0.5779953.0>,<0.5779924.0>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           <<"10.47.104.51:3215 -> 10.47.88.201:5672">>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           {user,<<"xxxxxxx">>,[],
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                            [{rabbit_auth_backend_oauth2,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                              #Fun<rabbit_auth_backend_oauth2.5.36627292>}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           <<"lido">>,<<>>,<0.5779934.0>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           [{<<"connection.blocked">>,bool,false},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                            {<<"consumer_cancel_notify">>,bool,true}],
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           none,0,1800000,#{},infinity,1000000000},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {lstate,<0.5779955.0>,false},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          none,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {0,[],[]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          undefined,#{},#{},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {state,fine,5000,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           #Ref<0.1001428202.3157000194.108089>},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          true,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {rabbit_confirms,undefined,#{}},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          [],[],none,flow,[],
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {rabbit_queue_type,#{}},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          #Ref<0.1001428202.3157000194.108085>,false}
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Reason for termination ==
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** {{exception,partition_parallel_timeout},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>     [{erpc,call,5,[{file,"erpc.erl"},{line,264}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_queue_type_util,erpc_call,5,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                              [{file,"rabbit_queue_type_util.erl"},{line,81}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_quorum_queue,start_cluster,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           [{file,"rabbit_quorum_queue.erl"},{line,283}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_channel,handle_method,6,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                      [{file,"rabbit_channel.erl"},{line,2437}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_channel,handle_method,3,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                      [{file,"rabbit_channel.erl"},{line,1564}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_channel,handle_cast,2,[{file,"rabbit_channel.erl"},{line,615}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>   crasher:
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     initial call: rabbit_channel:init/1
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     pid: <0.5779956.0>
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     registered_name: []
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     exception exit: {{exception,partition_parallel_timeout},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                      [{erpc,call,5,[{file,"erpc.erl"},{line,264}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_queue_type_util,erpc_call,5,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_queue_type_util.erl"},{line,81}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_quorum_queue,start_cluster,1,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_quorum_queue.erl"},{line,283}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,handle_method,6,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_channel.erl"},{line,2437}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,handle_method,3,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_channel.erl"},{line,1564}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,handle_cast,2,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_channel.erl"},{line,615}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {gen_server2,handle_msg,2,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"gen_server2.erl"},{line,1056}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {proc_lib,init_p_do_apply,3,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"proc_lib.erl"},{line,329}]}]}
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     ancestors: [<0.5779951.0>,<0.5779948.0>,<0.5779909.0>,<0.5779925.0>,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   <0.874.0>,<0.873.0>,<0.872.0>,<0.870.0>,<0.869.0>,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   rabbit_sup,<0.216.0>]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     message_queue_len: 5
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     messages: [{<0.5780268.0>,true},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {'DOWN',#Ref<0.1001428202.3157000194.108819>,process,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           <0.5780268.0>,normal},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   emit_stats,emit_gen_server2_stats,tick]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     links: [<0.5779951.0>]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     dictionary: [{rand_seed,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {#{max => 288230376151711743,type => exsplus,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                          next => #Fun<rand.5.40079776>,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                          jump => #Fun<rand.3.40079776>},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                        [161353397928511601|287143983225069820]}},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {permission_cache,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       [{{resource,<<"lido">>,queue,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                             <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                         #{},configure}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {channel_operation_timeout,60000},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {permission_cache_can_expire,true},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {process_name,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           {<<"10.47.104.51:3215 -> 10.47.88.201:5672">>,1}}}]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     trap_exit: true
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     status: running
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     heap_size: 10958
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     stack_size: 29
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     reductions: 54575
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>   neighbours:
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     supervisor: {<0.5779951.0>,rabbit_channel_sup}
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     errorContext: child_terminated
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     reason: {{exception,partition_parallel_timeout},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>              [{erpc,call,5,[{file,"erpc.erl"},{line,264}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_queue_type_util,erpc_call,5,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                       [{file,"rabbit_queue_type_util.erl"},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                        {line,81}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_quorum_queue,start_cluster,1,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                    [{file,"rabbit_quorum_queue.erl"},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                     {line,283}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_channel,handle_method,6,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                               [{file,"rabbit_channel.erl"},{line,2437}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_channel,handle_method,3,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                               [{file,"rabbit_channel.erl"},{line,1564}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_channel,handle_cast,2,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                               [{file,"rabbit_channel.erl"},{line,615}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {gen_server2,handle_msg,2,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                            [{file,"gen_server2.erl"},{line,1056}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {proc_lib,init_p_do_apply,3,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         [{file,"proc_lib.erl"},{line,329}]}]}
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     offender: [{pid,<0.5779956.0>},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {id,channel},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {mfargs,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                    {rabbit_channel,start_link,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                        [1,<0.5779924.0>,<0.5779953.0>,<0.5779924.0>,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         <<"10.47.104.51:3215 -> 10.47.88.201:5672">>,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         rabbit_framing_amqp_0_9_1,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         {user,<<"xxxxxxx">>,[],
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                             [{rabbit_auth_backend_oauth2,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                  #Fun<rabbit_auth_backend_oauth2.5.36627292>}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         <<"lido">>,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         [{<<"connection.blocked">>,bool,false},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                          {<<"consumer_cancel_notify">>,bool,true}],
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         <0.5779934.0>,<0.5779955.0>]}},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {restart_type,transient},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {significant,true},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {shutdown,70000},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {child_type,worker}]
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     supervisor: {<0.5779951.0>,rabbit_channel_sup}
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     errorContext: shutdown
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     reason: reached_max_restart_intensity
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     offender: [{pid,<0.5779956.0>},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {id,channel},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {mfargs,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                    {rabbit_channel,start_link,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                        [1,<0.5779924.0>,<0.5779953.0>,<0.5779924.0>,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         <<"10.47.104.51:3215 -> 10.47.88.201:5672">>,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         rabbit_framing_amqp_0_9_1,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         {user,<<"xxxxxxx">>,[],
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                             [{rabbit_auth_backend_oauth2,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                                  #Fun<rabbit_auth_backend_oauth2.5.36627292>}]},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         <<"lido">>,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         [{<<"connection.blocked">>,bool,false},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                          {<<"consumer_cancel_notify">>,bool,true}],
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         <0.5779934.0>,<0.5779955.0>]}},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {restart_type,transient},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {significant,true},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {shutdown,70000},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {child_type,worker}]

server 0:

574895:2025-04-23 10:18:48.261339+00:00 [debug] <0.5779956.0> Will start up to 3 replicas for quorum queue queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido' with leader on node 'rabbit@rabbitmq-server-0.rabbitmq-nodes.xxxxx', initial machine version 5
574919:2025-04-23 10:18:48.628726+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
574920:2025-04-23 10:18:48.629252+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
574921:2025-04-23 10:18:48.629809+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
574939:2025-04-23 10:18:49.378959+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0}
574959:2025-04-23 10:18:49.726729+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0
574960:2025-04-23 10:18:49.726869+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0
574961:2025-04-23 10:18:49.727319+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms
574962:2025-04-23 10:18:49.727468+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0
574963:2025-04-23 10:18:49.727641+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0
574964:2025-04-23 10:18:49.727818+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0
592793:2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>,
592873:2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                             <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>},
595996:2025-04-23 10:19:54.884806+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
669352:2025-04-23 10:23:14.944921+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
678457:2025-04-23 10:24:49.023934+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': follower received unhandled msg: {ra_log_event,{segments,[{...}],[...]}}
713621:2025-04-23 10:26:29.950377+00:00 [error] <0.5914313.0>  operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': {'lido_91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9',\n                                                                                                                                            'rabbit@rabbitmq-server-1.rabbitmq-nodes.xxxxx'}"
714189:2025-04-23 10:26:34.982961+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.

server 1:

628440:2025-04-23 10:18:49.202017+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0}
628461:2025-04-23 10:18:49.262588+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0
628462:2025-04-23 10:18:49.266395+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0
628463:2025-04-23 10:18:49.266568+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms
628464:2025-04-23 10:18:49.267539+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0
628465:2025-04-23 10:18:49.267793+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0
628466:2025-04-23 10:18:49.268106+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0
677717:2025-04-23 10:22:04.886201+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
754750:2025-04-23 10:25:20.161779+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': follower received unhandled msg: {ra_log_event,{segments,[{...}],[...]}}
754983:2025-04-23 10:25:24.949112+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.

server 2:

648735:2025-04-23 10:19:50.911784+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0}
648803:2025-04-23 10:19:51.159007+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0
648815:2025-04-23 10:19:51.167261+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0
648840:2025-04-23 10:19:51.329633+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms
648938:2025-04-23 10:19:51.367773+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0
648946:2025-04-23 10:19:51.397766+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0
648955:2025-04-23 10:19:51.398296+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0
662995:2025-04-23 10:20:59.884392+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
704973:2025-04-23 10:23:09.886627+00:00 [error] <0.5830650.0>  operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': {'lido_91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9',\n                                                                                                                                            'rabbit@rabbitmq-server-1.rabbitmq-nodes.xxxxx'}"

I have attempted to create a reproducer program, but unfortunately I'm currently struggling to trigger the issue with non-proprietary code.
The issue also does not reproduce by just creating huge numbers of queues, it seems very timing dependent.

Reproduction steps

The script that I'm unfortunately unable to release at the moment attempts to simulate our system's behaviour:

In parallel (16x concurrency):
- Create 5-50 quorum queues (random), each using separate connections, wait for all to complete
- Consume from all the queues, also on separate connections
- Using a single connection, publish a message to all of the queues over a 5 second period
- Consumers disconnect after receiving a message
- If the consumer does not receive a message within 2 minutes, then we have encountered the broken state
- Delete all the queues, also using separate connections
- Sleep for from 0 to 5 seconds, repeat

Please note the above is significantly higher load than our production system is subjected to.

With this script I am usually able to get queues into this state within a few hours.
I was also unable to reproduce it under a local Kind cluster, so it may be necessary to simulate network and disk latency.

Expected behavior

Queues eventually recover from this state or the client receives an error/disconnect and can try again later.

Additional context

No response

kjnilsson · 2025-04-29T13:49:53Z

kjnilsson
Apr 29, 2025
Maintainer

Declaring quorum queues is quite an expensive operation which is why we explicitly recommend against using quorum queues for high churn scenarios like this one. https://www.rabbitmq.com/docs/quorum-queues#when-not-to-use-quorum-queues

With that in mind I feel your server specs are too low for this 2 CPUs and only 10GiB of EBS is likely to need a bump for reliable operation anyway. I think you'll only get 100IOPS/sec with that size so likely you are ending up blocking on the storage.

I have never seen partition_parallel time out before - it uses a 60s timeout so your system must be underprovisioned. I suggest you bump your server specs substantially and see how that goes.

That said we can leave this as an issue as we can handle the partition_parallel timeout better to avoid leaving a queue record and stuck queue servers behind.

0 replies

michaelklishin · 2025-04-29T14:04:41Z

michaelklishin
Apr 29, 2025
Maintainer

@matthew-s-walker quorum queues were not designed for churn, which is exactly what your workflow is doing. Use non-replicated classic queues and try 4 cores.

0 replies

michaelklishin · 2025-04-29T14:12:55Z

michaelklishin
Apr 29, 2025
Maintainer

@kjnilsson has identified something to address => moved back to an issue #13828.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quorum queues can enter a state it cannot recover from due to a timeout #13827

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Quorum queues can enter a state it cannot recover from due to a timeout #13827

matthew-s-walker Apr 29, 2025

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments

kjnilsson Apr 29, 2025 Maintainer

michaelklishin Apr 29, 2025 Maintainer

michaelklishin Apr 29, 2025 Maintainer

matthew-s-walker
Apr 29, 2025

kjnilsson
Apr 29, 2025
Maintainer

michaelklishin
Apr 29, 2025
Maintainer

michaelklishin
Apr 29, 2025
Maintainer