Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using agentbeat versus all the other beats from the beats repository #4516

Merged
merged 19 commits into from
Apr 18, 2024

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Apr 4, 2024

What does this PR do?

Switches to used the agentbeat that comes from this PR: elastic/beats#38183

Why is it important?

Reduces the size of the shipped Elastic Agent by almost 50%

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

How to test this PR locally

$ EXTERNAL="true" SNAPSHOT="true" mage package

Related issues

Logs

Component logs are even correct and still showing metricbeat or filebeat when the binary being executed is actual agentbeat metricbeat and agentbeat filebeat respectively.

@blakerouse blakerouse added Team:Elastic-Agent Label for the Agent team backport-skip labels Apr 4, 2024
@blakerouse blakerouse self-assigned this Apr 4, 2024
@blakerouse blakerouse marked this pull request as ready for review April 16, 2024 13:01
@blakerouse blakerouse requested a review from a team as a code owner April 16, 2024 13:02
@blakerouse blakerouse requested review from AndersonQ and pchila April 16, 2024 13:02
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pierrehilbert pierrehilbert requested review from michalpristas and cmacknz and removed request for AndersonQ April 16, 2024 13:26
@blakerouse
Copy link
Contributor Author

Sonarqube is wrong all the changes in this PR are fully covered by unit tests, I have confirmed it with my own coverage report review. I assume its counting the magefile contents in its percentage.

@blakerouse
Copy link
Contributor Author

CI is broken so I am unable to get a green CI run on this PR, but I do see in the CI jobs that the elastic-agent was built with agentbeat. That is a great sign that it was built with it and was wanting to test elastic-agent with the agentbeat.

@cmacknz
Copy link
Member

cmacknz commented Apr 16, 2024

LGTM pending the tests passing

@blakerouse
Copy link
Contributor Author

@cmacknz I had to make some changes to allow the elastic-agent-cloud image to be built. I have tested that it works and allows filebeat and metricbeat to be executed from there original paths all transparently running agentbeat.

https://github.com/elastic/elastic-agent/pull/4516/files#diff-9e8bcd1359f1e06bbdf2c89db711c6c400ed4142cf9764ff03e6e963013122e5R284

Probably would be good to get someone from cloud to review this change, I don't know who to ask.

@cmacknz
Copy link
Member

cmacknz commented Apr 16, 2024

Best thing to do for cloud is to test it by creating a deployment from your branch. https://github.com/elastic/elastic-agent?tab=readme-ov-file#testing-on-elastic-cloud

Make sure this branch has eca5bc7 or it will fail for unrelated reasons. There is also possibly a fleet-server issue you may see that is unrelated, that won't be fixed until the agentbeat CI fix is merged.

@cmacknz
Copy link
Member

cmacknz commented Apr 17, 2024

buildkite test it

@cmacknz
Copy link
Member

cmacknz commented Apr 17, 2024

The runtime leak tests are failing and might need some adjustment, it is trying to get the beat stats directly from each component, and now all the components are the same process.

https://github.com/elastic/elastic-agent/blob/main/testing/integration/agent_long_running_leak_test.go

@blakerouse
Copy link
Contributor Author

Very strange. I would have expected that to work the same, as I tried to ensure it didn't adjust any naming or paths.

Trying to figure out what the issue is. Working on it.

@cmacknz
Copy link
Member

cmacknz commented Apr 17, 2024

I see this in the logs, not sure if it's the cause but it's going to break the monitoring data:

"component":{"binary":"agentbeat","dataset":"elastic_agent.agentbeat","id":"log-default","type":"log"}

The dataset elastic_agent.agentbeat doesn't exist and Fleet won't create index permissions.

I can see several "cannot index event" errors I believe are related:

{"log.level":"warn","@timestamp":"2024-04-17T17:46:53.872Z","message":"Cannot index event (status=403): dropping event! Enable debug logs to view the event and cause.","component":{"binary":"agentbeat","dataset":"elastic_agent.agentbeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":429,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).bulkCollectPublishFails"},"service.name":"filebeat","ecs.version":"1.6.0"}

@blakerouse
Copy link
Contributor Author

blakerouse commented Apr 17, 2024

I am seeing something a little different:

{"log.level":"info","@timestamp":"2024-04-17T18:46:52.217-0400","message":"Metrics endpoint listening on: /tmp/elastic-agent/Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock (configured: unix:///tmp/elastic-agent/Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock)","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.logger":"api","log.origin":{"file.line":71,"file.name":"api/server.go","function":"github.com/elastic/beats/v7/libbeat/api.(*Server).Start.func1"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Apr 18, 2024

Something is definitely up with the monitoring sockets. There are 4 unique PIDs but only 3 unix sockets:

❯ sudo elastic-agent status --output=full
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 294ec647-1044-47ca-8c64-714f9b48dcbb
   │  ├─ version: 8.14.0
   │  └─ commit: d44fe9fac2a56d529ddf30833fd2b70535588249
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '75187'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '75186'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '75188'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '75185'
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ system/metrics-default-unique-system-metrics-input
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT
~/Downloads/builds/elastic-agent-8.14.0-SNAPSHOT-darwin-aarch64 ··························· 08:38:38 PM
❯ sudo ls /Library/Elastic/Agent/data/tmp/
Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock   iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock
akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock

@cmacknz
Copy link
Member

cmacknz commented Apr 18, 2024

❯ ps aux | rg beat
cmackenzie       79016   0.0  0.0 410066224     48 s052  S+    8:40PM   0:00.00 rg beat
root             75188   0.0  0.4 412278336 145824   ??  S     8:36PM   0:00.41 /Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/components/agentbeat metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///Library/Elastic/Agent/data/tmp/akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock -E path.data=/Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/run/http/metrics-monitoring
root             75187   0.0  0.4 412295936 147472   ??  S     8:36PM   0:00.44 /Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/components/agentbeat metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///Library/Elastic/Agent/data/tmp/Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock -E path.data=/Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/run/beat/metrics-monitoring
root             75186   0.0  0.4 412243552 141776   ??  S     8:36PM   0:00.54 /Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/components/agentbeat filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${FILEBEAT_GOGC:100} -E filebeat.config.modules.enabled=false -E path.data=/Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/run/filestream-monitoring
root             75185   0.0  0.4 412237216 149696   ??  S     8:36PM   0:00.98 /Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/components/agentbeat metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///Library/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock -E path.data=/Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/run/system/metrics-default

The filestream monitoring isn't getting a monitoring unix socket, and it appears as a component in the output of elastic-agent status, which is causing the leak detection test to fail:

root             75186   0.0  0.4 412243552 141776   ??  S     8:36PM   0:00.54 /Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/components/agentbeat filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${FILEBEAT_GOGC:100} -E filebeat.config.modules.enabled=false -E path.data=/Library/Elastic/Agent/data/elastic-agent-8.14.0-SNAPSHOT-d44fe9/run/filestream-monitoring

@cmacknz
Copy link
Member

cmacknz commented Apr 18, 2024

I still see attempts to index to the non-existant elastic_agent.agentbeat datastream:

{"log.level":"warn","@timestamp":"2024-04-18T00:44:26.369Z","message":"Cannot index event (status=403): dropping event! Enable debug logs to view the event and cause.","component":{"binary":"agentbeat","dataset":"elastic_agent.agentbeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":429,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).bulkCollectPublishFails"},"ecs.version":"1.6.0"}

The list of data streams for monitoring is hard coded into Fleet, we need a Kibana change for this to be accepted. Regardless of the permission issue, we need to use the correct beat alias (filebeat, metricbeat, etc) to avoid breaking all of the elastic_agent dashboards.

@cmacknz
Copy link
Member

cmacknz commented Apr 18, 2024

The logs ingestion test should be failing as well giving us a stronger hint to the problem, but it only looks for metricbeat logs:

// Stage 1: Make sure metricbeat logs are populated
t.Log("Making sure metricbeat logs are populated")
docs := findESDocs(t, func() (estools.Documents, error) {
return estools.GetLogsForDataset(ctx, info.ESClient, "elastic_agent.metricbeat")
})
t.Logf("metricbeat: Got %d documents", len(docs.Hits.Hits))
require.NotZero(t, len(docs.Hits.Hits),
"Looking for logs in dataset 'elastic_agent.metricbeat'")

The unix socket connection errors are also in the agent logs:

{"log.level":"error","@timestamp":"2024-04-18T00:50:06.000Z","message":"Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /Library/Elastic/Agent/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock: connect: no such file or directory","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"ecs.version":"1.6.0","log.origin":{"file.line":256,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch"},"service.name":"metricbeat","ecs.version":"1.6.0"}

The error detection logic should also be failing here but I don't think it waits long enough to see all possible errors: https://github.com/blakerouse/elastic-agent/blob/d44fe9fac2a56d529ddf30833fd2b70535588249/testing/integration/logs_ingestion_test.go#L266-L269

@blakerouse
Copy link
Contributor Author

@cmacknz The issue is actually in the agentbeat.spec.yml that is defined in the beats repository.

elastic/beats#39025

@blakerouse
Copy link
Contributor Author

PR to fix the specification - elastic/beats#39026

@pchila
Copy link
Member

pchila commented Apr 18, 2024

buildkite test this

@pierrehilbert
Copy link
Contributor

After elastic/beats#39026 has been merged, @pchila ran tests locally and the failure is fixed.
As discussed with @jlind23, to avoid blocking the first BC, I'm force merging this one.

@pierrehilbert pierrehilbert merged commit 44319e2 into elastic:main Apr 18, 2024
4 of 9 checks passed
Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Offer a lightweight Elastic Agent
6 participants