Skip to content

Latest commit

 

History

History
51 lines (34 loc) · 3.61 KB

agent-health-status.asciidoc

File metadata and controls

51 lines (34 loc) · 3.61 KB

{agent} health status

The {agent} monitoring documentation describes the features available through the {fleet} UI for you to view {agent} status and activity, access metrics and diagnostics, enable alerts, and more.

For details about how the {agent} status is monitored by {fleet}, including connectivity, check-in frequency, and similar, see the following:

How does {agent} connect to the {fleet} to report its availability and health, and receive policy updates?

After enrollment, {agent} regularly initiates a check-in to {fleet-server} using HTTP long-polling ({fleet-server} is either deployed on-premises or deployed as part of {es} in {ecloud}).

The HTTP long-polling request is kept open until there’s a configuration change that {agent} needs to consume, an action that is sent to the agent, or a 5 minute timeout has elapsed. After 5 minutes, the agent will again send another check-in to start the process over again.

The frequency of check-ins can be configured to a new value with the condition that it may affect the maximum number of agents that can connect to {fleet}. Our regular scale testing of the solution doesn’t modify this parameter.

Diagram of connectivity between agents

We use stack monitoring to monitor the status of our cluster. Is monitoring of {agent} and the status shown in {fleet} using stack monitoring as well?

No. The health monitoring of {agent} and its inputs, as reported in {fleet}, is done completely outside of what stack monitoring provides.

There are many components that make up {agent}. How does {agent} ensure that these components/processes are up and running, and healthy?

{agent} is essentially a supervisor that (at a minimum) will deploy a {filebeat} instance for log collection and a {metricbeat} instance for metrics collection from the system and applications running on that system. As a supervisor, it also ensures that these spawned processes are running and healthy. Using gRPC, {agent} communicates with the underlying processes once every 30 seconds, ensuring their health. If there’s no response, the agent will transfer to being Unhealthy with the result and details reported to {fleet}.

If {agent} goes down, is an alert generated by {fleet}?

No. Alerts would have to be created in {kib} on the indices that show the total count of agents at each specific state. Refer to [fleet-alerting] in the {agent} monitoring documentation for the steps to configure alerting. Generating alerts on status change on individual agents is currently planned for a future release.

How long does it take for {agent} to report a status change?

Some {agent} states are reported immediately, such as when the agent has become Unhealthy. Some other states are derived after a certain criteria is met. Refer to [view-agent-status] in the {agent} monitoring documentation for details about monitoring agent status.

Transition from an Offline state to an Inactive state is configurable by the user and that transition can be fine tuned by Setting the inactivity timeout parameter.