`eaccess` exception when an CLI operation is performed on a node in a booting container #11954

jeremyestein · 2024-08-08T12:02:07Z

jeremyestein
Aug 8, 2024

Describe the bug

rabbitmq version: 3.13.6 (docker image e42cf20fe44e) (crash also present in 4.0.0-beta.3-management 45deda312190)

Can reproduce on docker on Linux (Docker version 23.0.0, build e92dd87) and macOS (Docker version 25.0.3, build 4debf41)

I discovered this when using rabbitmq-diagnostics -q check_running as a container healthcheck, although any invocation of that command will do it.
If it's run very quickly after container startup then the server crashes. It's possible to run it too quickly or too slowly to reproduce the crash - I'd say about 2-5 seconds is the sweet spot.

Reproduction steps

Get latest version (3.13.6) docker pull rabbitmq:management
Start the container docker run --name jes_test_rabbit rabbitmq:management
Quickly run in another terminal: docker exec jes_test_rabbit rabbitmq-diagnostics --help

The rabbitmq server crashes basically every time if you get the timing right.

Docker container logs are attached.

rabbitmq_crash_20240808_STDERR.log
rabbitmq_crash_20240808_STDOUT.log

In the stderr output, there is a statement: Crash dump is being written to: erl_crash.dump, however I can't find this file anywhere on the stopped container.

Expected behavior

The server should continue to boot normally.
In particular, rabbitmq-diagnostics -q check_running should return an error status to say the server is not yet running, rather than crashing the server.

Additional context

No response

lukebakken · 2024-08-08T14:15:48Z

lukebakken
Aug 8, 2024
Maintainer

My guess is that the Erlang VM that runs the rabbitmq-diagnostics command creates the /var/lib/rabbitmq/.erlang.cookie file in such a way that the Erlang VM that starts RabbitMQ can't read it (eacces), although the rabbitmq-diagnostics VM should be running as the rabbitmq user as well.

We might spend time addressing this issue but the quick solution is, of course, to not do this. Here is an example of how I wait for RabbitMQ to start in a docker container:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/.ci/ubuntu/gha-setup.sh#L84-L109

1 reply

michaelklishin Aug 8, 2024
Maintainer

rabbitmqctl await_startup, of course, is the right thing to use and exists exactly for this kind of scenarios.

But I'm afraid we have seen on the community Docker image side, that file mounts take time and in the case of the cookie, it can affect CLI tools even if you await_startup and generally do everything else properly.

A small delay might still be necessary in practice.

michaelklishin · 2024-08-08T14:15:59Z

michaelklishin
Aug 8, 2024
Maintainer

@jeremyestein I don't see how the error in the logs has anything to do with rabbitmq-diagnostics.

Anyhow,

Error when reading /var/lib/rabbitmq/.erlang.cookie: eacces

is very specific. The shared secret (Erlang cookie) is usually mounted as a volume and on container boot, that does not happen instantly. There is nothing RabbitMQ-specific about it.

I don't have a better solution than "wait before you assume that the container is ready for any operations".

0 replies

lukebakken · 2024-08-08T15:37:54Z

lukebakken
Aug 8, 2024
Maintainer

@jeremyestein please see this script which demonstrates a fix for the issue you report:

https://github.com/lukebakken/rabbitmq-server-11954/blob/main/repro.sh#L7

If the --user 999:999 argument to docker exec is removed, it is easy to reproduce this issue. As I suspected, the /var/lib/rabbitmq/.erlang.cookie file is owned by the root user.

I must admit I am a bit baffled by this behavior, because rabbitmq-diagnostics should always run as the rabbitmq (999:999) user, so I will investigate.

3 replies

lukebakken Aug 8, 2024
Maintainer

Nope, I'm mistaken, the various rabbitmq* commands run as root within docker:

$ docker exec rabbitmq-server-11954 /bin/sh -c 'echo $HOME; ls -la $HOME; rabbitmq-diagnostics -q check_running & ps -ef'
/var/lib/rabbitmq
total 16
drwxrwxrwt 3 rabbitmq rabbitmq 4096 Aug  8 15:55 .
drwxr-xr-x 1 root     root     4096 Apr 25 22:51 ..
-r-------- 1 rabbitmq rabbitmq   20 Aug  8 00:00 .erlang.cookie
drwxr-xr-x 4 rabbitmq rabbitmq 4096 Aug  8 15:55 mnesia
UID          PID    PPID  C STIME TTY          TIME CMD
rabbitmq       1       0  0 15:55 ?        00:00:00 /bin/sh /opt/rabbitmq/sbin/rabbitmq-server
rabbitmq      20       1  4 15:55 ?        00:00:05 /opt/erlang/lib/erlang/erts-14.2.4/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -pc unicode -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -sbwt none -sbwtdcpu none -sbwtdio none -B i -- -root /opt/erlang/lib/erlang -bindir /opt/erlang/lib/erlang/erts-14.2.4/bin -progname erl -- -home /var/lib/rabbitmq -- -pa  -noshell -noinput -s rabbit boot -boot start_sasl -syslog logger [] -syslog syslog_error_logger false -kernel prevent_overlapping_partitions false
rabbitmq      26      20  0 15:55 ?        00:00:00 erl_child_setup 1048576
rabbitmq      79      26  0 15:55 ?        00:00:00 /opt/erlang/lib/erlang/erts-14.2.4/bin/inet_gethost 4
rabbitmq      80      79  0 15:55 ?        00:00:00 /opt/erlang/lib/erlang/erts-14.2.4/bin/inet_gethost 4
rabbitmq      90       1  0 15:55 ?        00:00:00 /opt/erlang/lib/erlang/erts-14.2.4/bin/epmd -daemon
rabbitmq     155      26  0 15:55 ?        00:00:00 /bin/sh -s rabbit_disk_monitor
root        1165       0  0 15:58 ?        00:00:00 /bin/sh -c echo $HOME; ls -la $HOME; rabbitmq-diagnostics -q check_running & ps -ef
root        1172    1165  0 15:58 ?        00:00:00 /bin/sh /opt/rabbitmq/sbin/rabbitmq-diagnostics -q check_running
root        1175    1172  0 15:58 ?        00:00:00 /opt/rabbitmq/escript/rabbitmq-diagnostics -B -- -root /opt/erlang/lib/erlang -bindir /opt/erlang/lib/erlang/erts-14.2.4/bin -progname erl -- -home /var/lib/rabbitmq -- -noshell -boot no_dot_erlang -escript main rabbitmqctl_escript -hidden -run escript start -- -noshell -noinput -boot start_clean -kernel inet_dist_listen_min 35672 -kernel inet_dist_listen_max 35682 -- -extra /opt/rabbitmq/escript/rabbitmq-diagnostics -q check_running
root        1181    1175  0 15:58 ?        00:00:00 erl_child_setup 1048576
root        1182    1165  0 15:58 ?        00:00:00 ps -ef
RabbitMQ on node rabbit@8b5296ed07b0 is fully booted and running

So, the short of it is, if one of these commands runs BEFORE RabbitMQ creates the Erlang cookie, this problem could happen. It seems that this code does not do what is intended, probably because it tries to "fix" the permissions in /var/lib/rabbitmq BEFORE rabbitmq-diagnostics creates the problematic cookie file:

https://github.com/docker-library/rabbitmq/blob/master/docker-entrypoint.sh#L4-L11

michaelklishin Aug 8, 2024
Maintainer

@lukebakken we should document this in the community Docker image README (since they don't have doc guides per se).

I have asked the team where can we document this in general. Perhaps a new section in the CLI tools guide?

michaelklishin Aug 8, 2024
Maintainer

I ended up adding a new doc section to the CLI tools guide rabbitmq/rabbitmq-website@5141c94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`eaccess` exception when an CLI operation is performed on a node in a booting container #11954

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

eaccess exception when an CLI operation is performed on a node in a booting container #11954

jeremyestein Aug 8, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 4 replies

lukebakken Aug 8, 2024 Maintainer

michaelklishin Aug 8, 2024 Maintainer

michaelklishin Aug 8, 2024 Maintainer

lukebakken Aug 8, 2024 Maintainer

lukebakken Aug 8, 2024 Maintainer

michaelklishin Aug 8, 2024 Maintainer

michaelklishin Aug 8, 2024 Maintainer

`eaccess` exception when an CLI operation is performed on a node in a booting container #11954

jeremyestein
Aug 8, 2024

Replies: 3 comments 4 replies

lukebakken
Aug 8, 2024
Maintainer

michaelklishin Aug 8, 2024
Maintainer

michaelklishin
Aug 8, 2024
Maintainer

lukebakken
Aug 8, 2024
Maintainer

lukebakken Aug 8, 2024
Maintainer

michaelklishin Aug 8, 2024
Maintainer

michaelklishin Aug 8, 2024
Maintainer