-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab. #1912
Comments
@manishgupta-qasource Please review. |
Secondary review for this ticket is Done |
I'm not entirely sure what is causing this, but I see {"log.level":"error","@timestamp":"2022-12-08T08:28:43.761Z","message":"Error while stopping harvester group: task failures\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file\n\terror while adding new reader to the bookkeeper harvester is already running for file","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"prospector":"file_prospector","log.logger":"input.filestream","log.origin":{"file.line":294,"file.name":"filestream/prospector.go"},"id":"filestream-monitoring-agent","service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"} I can also see this is the Windows logs right after the log level change: {"log.level":"warn","@timestamp":"2022-12-08T08:27:13.855Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":190},"message":"Possible transient error during checkin with fleet-server, retrying","error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://37ac1814a0eb4fc2882b10eafd9e145b.fleet.us-central1.gcp.foundit.no:443/ errored: Post \"https://37ac1814a0eb4fc2882b10eafd9e145b.fleet.us-central1.gcp.foundit.no:443/api/fleet/agents/3f82c333-9c03-40c6-a099-a25fd8aec301/checkin?\": context canceled\n\n"},"request_duration_ns":0,"failed_checkins":1,"retry_after_ns":68201468007,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-12-08T08:27:13.855Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":207},"message":"checkin retry loop was stopped","ecs.version":"1.6.0"} |
Likely we should retest this after #1896 |
This should be resolved in the next 8.6 snapshot build or BC. |
Hi @cmacknz We have revalidated this issue on 8.6 BC7 Kibana staging and Prod environment and found this issue still reproducible. Build details: Please let us know if more details are required. Thanks. |
@dikshachauhan-qasource the fix was merged to the 8.6 branch after the latest BC was built. Thus we need to wait for another BC. |
Hi @jlind23 Thanks for the update. We will retest this on next BC. |
@dikshachauhan-qasource @amolnater-qasource Please retest this with the latest snapshot being built today, along with #1959 |
Hi @cmacknz Observations:
Build details: Logs: Please let us know if anything else is required from our end. |
unhealthy state is due to beat restarts |
Hi @cmacknz, We have re-validated this issue on the latest 8.6.0 BC9 Kibana Cloud environment and found the below observations: Observations:
Screenshots: Build details:
Agents Logs:
Please let us know if we are missing anything. Thanks! |
the mentioned PR is not part of BC, please revalidate with SNAPSHOT |
@amolnater-qasource @dikshachauhan-qasource any update to provide here? |
Hi @jlind23
OS: Build details: Screen Recording: Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-01-03.17-06-33.mp4Logs Before agent restart: Logs After agent restart: Please let us know if anything else is required from our end. |
Yes, that is a symptom of elastic/beats#34137 which will be fixed in the next BC |
Closing this as elastic/beats#34137 was merged |
I think we are unnecessarily restarting the Beats when only the log level has changed for the output unit. |
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.766Z","log.origin":{"file.name":"handlers/handler_action_settings.go","file.line":68},"message":"Settings action done, setting agent log level to debug","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.774Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":729},"message":"Updating running component model","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed log-default-logfile-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e (HEALTHY->CONFIGURING): Configuring","component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed log-default (HEALTHY->CONFIGURING): Configuring","component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed beat/metrics-monitoring (HEALTHY->CONFIGURING): Configuring","component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed system/metrics-default-system/metrics-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e (HEALTHY->CONFIGURING): Configuring","component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed system/metrics-default (HEALTHY->CONFIGURING): Configuring","component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed filestream-monitoring-filestream-monitoring-agent (HEALTHY->CONFIGURING): Configuring","component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring-filestream-monitoring-agent","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed filestream-monitoring (HEALTHY->CONFIGURING): Configuring","component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed winlog-default-winlog-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e (HEALTHY->CONFIGURING): Configuring","component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-d5d984ea-8f6f-4a96-97ce-03c2e936327e","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.783Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed winlog-default (HEALTHY->CONFIGURING): Configuring","component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.785Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (HEALTHY->CONFIGURING): Configuring","component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.785Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":827},"message":"Unit state changed http/metrics-monitoring (HEALTHY->CONFIGURING): Configuring","component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring","type":"output","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-01-03T11:22:01.890Z","message":"beat is restarting because output changed","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log.logger":"centralmgmt.V2-manager","log.origin":{"file.line":503,"file.name":"management/managerV2.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"} |
elastic/beats#34178 (not a blocker, just an optimization). |
Hi @cmacknz Observations:
Build details: Screen Recording: Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-01-05.15-14-05.mp4Hence we are marking this issue as QA:Validated. |
Kibana version: 8.6 BC6 kibana cloud environment
Host OS and Browser version: All, All
Build details:
Preconditions:
Steps to reproduce:
(approximately 10 minutes)
.Logs:
[Windows]elastic-agent-diagnostics-2022-12-08T08-38-54Z-00.zip
[Linux]elastic-agent-diagnostics-2022-12-08T08-32-12Z-00.zip
[MAC]elastic-agent-diagnostics-2022-12-08T08-32-38Z-00.zip
Screenshot:


Expected Result:
Agents should remain healthy on changing logging level under Agent Logs tab.
The text was updated successfully, but these errors were encountered: