Quorum queues can enter a state it cannot recover from due to a timeout #13827
Replies: 3 comments
-
Declaring quorum queues is quite an expensive operation which is why we explicitly recommend against using quorum queues for high churn scenarios like this one. https://www.rabbitmq.com/docs/quorum-queues#when-not-to-use-quorum-queues With that in mind I feel your server specs are too low for this 2 CPUs and only 10GiB of EBS is likely to need a bump for reliable operation anyway. I think you'll only get 100IOPS/sec with that size so likely you are ending up blocking on the storage. I have never seen partition_parallel time out before - it uses a 60s timeout so your system must be underprovisioned. I suggest you bump your server specs substantially and see how that goes. That said we can leave this as an issue as we can handle the partition_parallel timeout better to avoid leaving a queue record and stuck queue servers behind. |
Beta Was this translation helpful? Give feedback.
-
@matthew-s-walker quorum queues were not designed for churn, which is exactly what your workflow is doing. Use non-replicated classic queues and try 4 cores. |
Beta Was this translation helpful? Give feedback.
-
@kjnilsson has identified something to address => moved back to an issue #13828. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
Hi,
Firstly, I want to thank you for your work on RabbitMQ. It has been a rock solid core component of our system for many years.
We migrated all of our queues to the quorum queue type recently but have unfortunately encountered stability problems in our production environment.
Our system creates temporary queues, often up to 50 across 1 second or so, and totalling roughly 20000 per day.
After migrating, we found that within a few hours some queues (typically several created at similar times) will go into a state where:
The issue either occurs immediately after/during creation or within 2-3 minutes of creation.
We can reproduce the behaviour on the following versions of RabbitMQ, but the errors logged by the servers are different in at least 4.1.0:
On 4.0.1 and below, we receive various "badmatch" "timeout" errors, which I can provide if wanted.
Our cluster setup is:
Typical cluster load is < 1000 total queues, < 500 total messages per second. The vast majority of messages are < 4KiB.
The issue reproduces with:
Here is an example of a queue going into a bad state with 4.1.0 (I am happy to provide logs from earlier versions as well):
server 0:
server 1:
server 2:
I have attempted to create a reproducer program, but unfortunately I'm currently struggling to trigger the issue with non-proprietary code.
The issue also does not reproduce by just creating huge numbers of queues, it seems very timing dependent.
Reproduction steps
The script that I'm unfortunately unable to release at the moment attempts to simulate our system's behaviour:
Please note the above is significantly higher load than our production system is subjected to.
With this script I am usually able to get queues into this state within a few hours.
I was also unable to reproduce it under a local Kind cluster, so it may be necessary to simulate network and disk latency.
Expected behavior
Queues eventually recover from this state or the client receives an error/disconnect and can try again later.
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions