Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux docker agent gets Unhealthy on adding linux integration. #2377

Open
amolnater-qasource opened this issue Mar 20, 2023 · 39 comments · Fixed by elastic/beats#35618
Open
Assignees
Labels
bug Something isn't working impact:medium Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@amolnater-qasource
Copy link

Kibana version: 8.7 BC6 Kibana cloud environment

Host OS:
Ubuntu 22 ARM64

Build details:
VERSION: 8.7 BC6
BUILD: 61051
COMMIT: 04ef24287f26854ad99a46ae983854c6184717cb

Preconditions:

  1. 8.7 BC6 Kibana cloud environment should be available.
  2. Docker setup should be done.

Steps to reproduce:

  1. Install a docker agent using below command:
sudo docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=<Fleet Server host URL> \
--env FLEET_ENROLLMENT_TOKEN=<enrollment token>
--rm docker.elastic.co/staging/elastic-agent:8.7.0-a7fb3750
  1. Add linux integration to this policy and observe agent goes Unhealthy.

Note:

  • All linux integration datasets should be enabled.

Expected Result:
Docker agent should remain healthy on adding linux integration.

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-03-16.23-07-34.mp4

Logs:
elastic-agent-diagnostics-2023-03-16T17-37-58Z-00.zip
elastic-agent-diagnostics-2023-03-16T17-43-24Z-00.zip

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels Mar 20, 2023
@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@jlind23
Copy link
Contributor

jlind23 commented Mar 20, 2023

@amolnater-qasource Looks like we faced some permission issues:
{"log.level":"error","@timestamp":"2023-03-16T17:33:32.349Z","message":"Error fetching data for metricset linux.pageinfo: error opening file: open /proc/pagetypeinfo: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"linux/metrics-default","type":"linux/metrics"},"log":{"source":"linux/metrics-default"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

@fearful-symmetry does it ring a bell or should I ask the obs-service team to look at this specific integration first?

@fearful-symmetry
Copy link
Contributor

@jlind23 There could be a few issues here; The original issue mentions docker, so it's possible that we need to set hostfs correctly and insure that /proc/pagetypeinfo is mounted into the container as /hostfs/proc/pagetypeinfo. It's also possible that /proc/pagetypeinfo does not exist on this particular OS at all.

@jlind23
Copy link
Contributor

jlind23 commented Mar 20, 2023

@amolnater-qasource could you please check what @fearful-symmetry said? On a side note, was this particular docker distribution working before now?

@amolnater-qasource
Copy link
Author

Hi @fearful-symmetry @jlind23

Thank you for looking into this issue.

We observed /proc/pagetypeinfo is setup on the the using VM. Could you please confirm how we can check if it is mounted into the container?

Further, this issue was earlier observed during 8.5.0 SNAPSHOT testing, reported under #1454
However later this was working fine on 8.6 BC10.

Please let us know if we are missing anything here.
Thanks!

@jlind23
Copy link
Contributor

jlind23 commented Mar 21, 2023

@amolnater-qasource can't you ssh in this container and see if it is mounted? Are you relying on a different base docker image?

@fearful-symmetry
Copy link
Contributor

Ah, brain skipped a beat, just noticed that it's actually a permissions error: /proc/pagetypeinfo: permission denied

I'm fairly certain that pagetypeinfo is one of those procfs files that's going to be the same as the host from within the container, which means it's not strictly necessary to mount it, and you can read from /proc/pagetypeinfo from within the container to monitor the host, but the permission error is a bit odd. Since the original issue mentions docker, my assumption is that there's a docker setup issue, and the metricbeat instance running in docker somehow doesn't have the proper permissions, or isn't running as root.

@amolnater-qasource
Copy link
Author

Hi @jlind23

For testing the docker agent we followed below steps:

  1. Setup Ubuntu 22.04 ARM architecture VM.
  2. Installed Docker on the VM.
  3. Directly ran below command:
sudo docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=<Fleet Server host URL> \
--env FLEET_ENROLLMENT_TOKEN=<enrollment token>
--rm docker.elastic.co/staging/elastic-agent:8.7.0-a7fb3750

So, as per our understanding we aren't creating any new container for this and we are using this docker image for installing an agent.

Please let us know if we are missing anything here.
Thanks

@jlind23
Copy link
Contributor

jlind23 commented Mar 23, 2023

@fearful-symmetry would be great to have your eyes on this as soon as you have time to make sure this is not a regression we introduced in metricbeat.

@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.8 BC6 Kibana cloud environment and found it still reproducible.

Observations:

  • Linux docker agent gets Unhealthy on adding linux integration.

Screenshot:
image

Logs:
elastic-agent-diagnostics-2023-05-19T08-18-10Z-00.zip

Build details:

VERSION: 8.8.0 BC6 Kibana cloud environment
BUILD: 63115
COMMIT: a4c256b39f7d1ee34abe61109a817ec7f5329009
Docker artifact: --rm docker.elastic.co/staging/elastic-agent:8.8.0-375abdf7 

Please let us know if anything else is required from our end.

Thanks!

@cmacknz
Copy link
Member

cmacknz commented May 23, 2023

This is a new error in the system metrics input:

- id: system/metrics-default
  state:
    state: 2
    message: 'Healthy: communicating with pid ''32'''
    units:
      ? unittype: 0
        unitid: system/metrics-default-system/metrics-system-aa6c87f0-f61c-11ed-b6d2-0b368c0c212a
      : state: 4
        message: '[failed to reloading inputs: 2 errors: Error creating runner from
          config: 1 error: error connecting to dbus: dial unix /var/run/dbus/system_bus_socket:
          connect: no such file or directory; Error creating runner from config: 1
          error: error connecting to dbus: error getting connection to system bus:
          dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory]'

@cmacknz
Copy link
Member

cmacknz commented May 23, 2023

@amolnater-qasource can you try to reproduce? I want to see if this error happens every time or is intermittent to assess the severity of the problem.

@fearful-symmetry
Copy link
Contributor

@cmacknz normally that error would be thrown by the linux/users or linux/services metricsets on systems that don't support dbus. Do we know if this is running on a supported OS?

@amolnater-qasource
Copy link
Author

Hi @cmacknz

Thank you for looking into this.

The issue is reproducible everytime the linux integration with all datasets enabled is added to the agent policy.

Agents:
Docker Agent

Host OS's:

  • Ubuntu 22 ARM64
  • Container Optimized OS
    image

Build details:

VERSION: 8.8 BC8 Kibana cloud environment
BUILD: 63142
COMMIT: 2973fcc10d985e4ab94e5eeef976aad0046c6cce

Logs:
elastic-agent-diagnostics-2023-05-24T06-05-09Z-00.zip

Please let us know if anything else is required from our end.
cc: @fearful-symmetry

Thanks!

@cmacknz
Copy link
Member

cmacknz commented May 24, 2023

@fearful-symmetry yes this is supported, we support both Ubuntu 22 and Google container optimized OS on ARM64 per https://www.elastic.co/support/matrix

As of 7.16+ releases, we support aarch64 on Linux with the same set of distributions as x86_64

Raising priority, adding to the next sprint since this happens every time.

@fearful-symmetry
Copy link
Contributor

Going to look into this more tomorrow, but what I think is happening is that because we're running in a container, the dbus socket for the host isn't reachable inside the container. Pretty sure there's an environment variable we can set that's used by the coreos libraries. I don't think this is documented anywhere, which is a bit of a problem.

@jlind23
Copy link
Contributor

jlind23 commented May 25, 2023

Thanks @fearful-symmetry for looking into this. If you assumption is right, putting a doc PR would definitely be enough for this.

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented May 26, 2023

@amolnater-qasource Can you try:

  • Mounting the /var/run/dbus/system_bus_socket socket into the agent container under test, just as /hostfs/var/run/dbus/system_bus_socket
  • Set the DBUS_SYSTEM_BUS_ADDRESS environment variable to the above mount point

@amolnater-qasource
Copy link
Author

Hi @fearful-symmetry

Thank you for sharing the details over slack and helping us revalidating this.

Please find below details for the attempted test:
On running below command:

sudo docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=************************** \
--env FLEET_ENROLLMENT_TOKEN=************************** \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--rm docker.elastic.co/beats/elastic-agent:8.9.0-3cc641a9-SNAPSHOT

We observed that the installed agent is Unhealthy and had below errors:
image
image
image
image

Agent Logs:
elastic-agent-diagnostics-2023-05-30T17-36-46Z-00.zip

Please let us know if anything else is required from our end.
Thanks!

@fearful-symmetry
Copy link
Contributor

Update while I look into this: I think there's some kind of formatting issue with the env var happening between the --env command in docker, or I'm just confused by how the dbus library works. Will investigate further.

@fearful-symmetry
Copy link
Contributor

Alright, found the issue, extremely dumb bug. There's two different versions of the godbus/dbus library at work, one we're using directly and another that was imported by another library we're using. They use two different formats for the DBUS_SYSTEM_BUS_ADDRESS, so either format would just break at different points.

Fix is here: elastic/beats#35618

@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.9.0 BC3 Kibana cloud environment and found it still reproducible.

Observations:

  • Linux docker agent gets Unhealthy on adding linux integration.

Build details:
VERSION: 8.9.0 BC3
BUILD: 64584
COMMIT: fc463b96275c55dc44524f79f617b0026b7f8667

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=***********************3 \
--env FLEET_ENROLLMENT_TOKEN=************************== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0

Screen Recording:

94504372f98a.-.Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-07-11.10-56-46.mp4
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-07-11.11-06-29.mp4

Logs:
elastic-agent-diagnostics-2023-07-11T05-37-28Z-00.zip

Hence, we are reopening this issue.
Thanks!

@amolnater-qasource amolnater-qasource removed the QA:Ready For Testing Code is merged and ready for QA to validate label Jul 11, 2023
@pierrehilbert
Copy link
Contributor

@fearful-symmetry could you please have a look?

@cmacknz
Copy link
Member

cmacknz commented Jul 11, 2023

Seems like this is dbus again:

- id: system/metrics-default
  state:
    state: 2
    message: 'Healthy: communicating with pid ''31'''
    units:
      ? unittype: 0
        unitid: system/metrics-default-system/metrics-system-331804e9-c84e-40e0-beae-805672378572
      : state: 4
        message: '[failed to reload inputs: 2 errors: Error creating runner from config:
          1 error: error connecting to dbus: dial unix /var/run/dbus/system_bus_socket:
          connect: no such file or directory; Error creating runner from config: 1
          error: error connecting to dbus: error getting connection to system bus:
          dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory]'
      ? unittype: 0

elastic/beats#35618 was supposed to fix this I believe.

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Jul 11, 2023

@amolnater-qasource is that the exact docker command? If you're using the dbus-related metricsets you need to add --volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \ as well as set the DBUS_SYSTEM_BUS_ADDRESS env var to /hostfs/var/run/dbus/system_bus_socket.

I suspect this isn't well documented; going to hunt around the system docs and see if I can find where we should put this.

@fearful-symmetry
Copy link
Contributor

Alright, tested with

docker run --volume=$(pwd)/metricbeat.reference.yml:/usr/share/metricbeat/metricbeat.yml \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
 --mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly  \
--mount type=bind,source=/,target=/hostfs,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--net=host docker.elastic.co/beats/metricbeat:8.9.0-SNAPSHOT -e --system.hostfs=/hostfs

Seems to work fine.

@jlind23
Copy link
Contributor

jlind23 commented Jul 12, 2023

Closing this as fixed then and I approved your doc Pr.
@amolnater-qasource csn we make sure the test case is updated with this command?

@jlind23 jlind23 closed this as completed Jul 12, 2023
@amolnater-qasource
Copy link
Author

amolnater-qasource commented Jul 12, 2023

Hi @fearful-symmetry @jlind23

Thank you for the confirmation and adding the docs.

We have re-attempted to install agent on docker with below updated commands:
First:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49a4c592f08bxxxxxxxxxxxxxxxxxp.cloud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjexxxxxxxxxxxxxxxxxxxxU4Nk82NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0

Second:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49axxxxxxxxxxxxxxxxxxxoud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjeVBfekw4dEFxxxxxxxxxxxx2NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly  \
--mount type=bind,source=/,target=/hostfs,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0
  • However, the agent remained Unhealthy at our end with linux metrics enabled.

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-07-12.10-09-22.mp4

For troubleshooting we also tried adding below config to linux integration.
image

However, the agent still remained Unhealthy.

Logs:
elastic-agent-diagnostics-2023-07-12T04-46-06Z-00.zip

Please let us know if we are missing anything here.

Thank you

@fearful-symmetry
Copy link
Contributor

A little baffled by this, since I'm seeing tons of errors that seem to suggest that the hostfs flag is set, but the actual directory isn't mounted in:
network io counters: open /hostfs/proc/net/dev: no such file or directory
disk io counters: open /hostfs/proc/diskstats
disk io counters: open /hostfs/proc/diskstats: no such file or directory
error getting entropy: error reading from random: open /hostfs/proc/sys/kernel/random/entropy_avail: no such file or directory

We might want to take care to create the policy with hostfs set first, then run the agent in docker with the proper mounts, and see what happens, or at least collect another diagnostic bundle if it continues to not work.

@amolnater-qasource
Copy link
Author

Hi @fearful-symmetry

Thank you for looking into this again.
Yes, we have added hostfs to the policy first and then run the agent in docker.

  • Agents remained Unhealthy throughout with Linux integration.

For getting the logs we have reattempted with two different set of commands for running agent:
First Command:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49axxxxxxxxxxxxxxxxxxxoud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjeVBfekw4dEFxxxxxxxxxxxx2NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly  \
--mount type=bind,source=/,target=/hostfs,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0

Debug Logs for this agent are:
elastic-agent-diagnostics-2023-07-13T04-28-36Z-00.zip

Second Command:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49a4c592f08bxxxxxxxxxxxxxxxxxp.cloud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjexxxxxxxxxxxxxxxxxxxxU4Nk82NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0

Agent logs for this agent are:
elastic-agent-diagnostics-2023-07-13T06-24-57Z-00.zip

Screenshot:
image

Please let us know if we are missing anything here.

Thanks!

@fearful-symmetry
Copy link
Contributor

Ah, there we go:

{"log.level":"error","@timestamp":"2023-07-13T04:19:27.011Z","message":"Error creating runner from config: 1 error: error connecting to dbus: error in Hello: An AppArmor policy prevents this sender from sending this message to this recipient; type=\"method_call\", sender=\"(null)\" (inactive) interface=\"org.freedesktop.DBus\" member=\"Hello\" error name=\"(unset)\" requested_reply=\"0\" destination=\"org.freedesktop.DBus\" (bus)","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.origin":{"file.line":138,"file.name":"cfgfile/list.go"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt","ecs.version":"1.6.0"}

It looks like AppArmor is stopping the dbus Hello message, which isn't something I think I've ever seen before. @amolnater-qasource can you tell me precisely what ubuntu release this is so I can try and document some kind of workaround? The output of uname -a should be enough.

@amolnater-qasource
Copy link
Author

Hi @fearful-symmetry

Please find below exact host details:
image

Further it is deployed from AWS- Ubuntu 22.04 with ARM64 architecture.
image

Please let us know if anything else is required from our end.

Thanks!

@fearful-symmetry
Copy link
Contributor

Huzzah, was able to reproduce this. Interestingly, this only seems to happen with docker, which is probably why we haven't seen this before.

@fearful-symmetry
Copy link
Contributor

So, we can temporarily work around this by adding --security-opt apparmor=unconfined to the beginning of the docker run:

docker run --security-opt apparmor=unconfined --volume=$(pwd)/metricbeat.yml:/usr/share/metricbeat/metricbeat.yml --mount type=bind,source=/proc,target=/hostfs/proc,readonly  --mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly  --mount type=bind,source=/,target=/hostfs,readonly --volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket --env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' --net=host docker.elastic.co/beats/metricbeat:8.9.0-SNAPSHOT -e --system.hostfs=/hostfs

This doesn't seem like the best solution, and I'd like to come up with a more targeted apparmor role.

@jlind23
Copy link
Contributor

jlind23 commented May 27, 2024

@amolnater-qasource Is this still an issue you face?

@amolnater-qasource
Copy link
Author

Hi @jlind23

We have revalidated this issue on latest 8.14.0 BC5 kibana cloud environment and found it still reproducible with the actual command:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://<url>cloud.com:443 \
--env FLEET_ENROLLMENT_TOKEN=Q<token>9DUQ== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--rm docker.elastic.co/staging/elastic-agent:8.14.0-eeda34a5

Observations:

  • Linux docker agent gets Unhealthy on adding linux integration.

Agent Logs:
elastic-agent-diagnostics-2024-05-28T08-53-03Z-00.zip

Screenshot:
image

We were expecting this to fix as per #2377 (comment)

Please let us know if anything else is required from our end.

Thanks!

@jlind23 jlind23 added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 28, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member

cmacknz commented May 28, 2024

Yes this is the same error originally detected in #2377 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants