Troubleshoot common problems

We have collected the most common known problems and listed them here. If your problem is not described here, please review the open issues in the following GitHub repositories:

Repository	To review or report issues about
elastic/kibana	{fleet} and {integrations} UI
elastic/elastic-agent	{agent}
elastic/beats	{beats} shippers
elastic/fleet-server	{fleet-server}
elastic/package-registry	{package-registry}
elastic/observability-docs	Documentation issues

Have a question? Read our FAQ, or contact us in the {forum}[discuss forum]. Your feedback is valuable to us.

Running {agent} standalone? Also refer to [debug-standalone-agents].

Troubleshooting contents

Find troubleshooting information for {fleet}, {fleet-server}, and {agent} in the following documentation:

{agent} unenroll fails
illegal_argument_exception when TSDB is enabled
{agent}s hosted on {ecloud} are stuck in Updating or Offline
When using {ecloud}, {fleet-server} is not listed in {kib}
The /api/fleet/setup endpoint can’t reach the package registry
{kib} cannot connect to {package-registry} in air-gapped environments
{fleet} in {kib} crashes
{agent} enrollment fails on the host with x509: certificate signed by unknown authority message
{agent} enrollment fails on the host with x509: cannot validate certificate for x.x.x.x because it doesn’t contain any IP SANs message
{agent} enrollment fails on the host with Client.Timeout exceeded message
Many {fleet-server} problems can be triaged and fixed with the below tips
Retrieve the {agent} version
Check the {agent} status
Collect {agent} diagnostics bundle
Some problems occur so early that insufficient logging is available
The {agent} is cited as Healthy but still has set up problems sending data to {es}
{agent} is stuck in status Updating
{fleet-server} is running and healthy with data, but other Agents cannot use it to connect to {es}
{es} authentication service fails with Authentication using apikey failed message
{agent} fails with Agent process is not root/admin or validation failed message
Integration policy upgrade has too many conflicts
{agent} hangs while unenrolling
On {fleet-server} startup, ERROR seen with State changed to CRASHED: exited with code: 1
Uninstalling {elastic-endpoint} fails
API key is unauthorized to send telemetry to .logs-endpoint.diagnostic.collection-* indices
Hosted {agent} is offline
APM & {fleet} fails to upgrade to 8.x on {ecloud}
Air-gapped {agent} upgrade can fail due to an inaccessible PGP key
{agents} are unable to connect after removing the {fleet-server} integration
{agent} Out of Memory errors on Kubernetes
Error when running {agent} commands with sudo
Troubleshoot {agent} installation on Kubernetes, with Kustomize
Troubleshoot {agent} on Kubernetes seeing invalid api key to authenticate with fleet in logs

{agent} unenroll fails

In {fleet}, if you delete an {agent} policy that is associated with one or more inactive enrolled agents, when the agent returns back to a Healthy or Offline state, it cannot be unenrolled. Attempting to unenroll the agent results in an Error unenrolling agent message, and the unenrollment fails.

To resolve this problem, you can use the {kib} {fleet} APIs to force unenroll the agent.

To uninstall a single {agent}:

POST kbn:/api/fleet/agents/<agent_id>/unenroll
{
  "force": true,
  "revoke": true
}

To bulk uninstall a set of {agents}:

POST kbn:/api/fleet/agents/bulk_unenroll
{ "agents": ["<agent_id1>", "<agent-id2>"],
  "force": true,
  "revoke": true
}

We are also updating the {fleet} UI to prevent removal of an {agent} policy that is currently associated with any inactive agents.

illegal_argument_exception when TSDB is enabled

When you use an {agent} integration in which TSDB (Time Series Database) is enabled, you may encounter an illegal_argument_exception error in the {fleet} UI.

This can occur if you have a component template defined that includes a _source attribute, which conflicts with the _source: synthetic setting used when TSDB is enabled.

For details about the error and how to resolve it, refer to the section Runtime fields cannot be used in TSDB indices in the Innovation Hub article TSDB enabled integrations for {agent}.

{agent}s hosted on {ecloud} are stuck in `Updating` or `Offline`

In {ecloud}, after upgrading {fleet-server} and its integration policies, agents enrolled in the {ecloud} agent policy may experience issues updating. To resolve this problem:

In a terminal window, run the following cURL request, providing your {kib} superuser credentials to reset the {ecloud} agent policy.

On {kib} versions 8.11 and later, run:

curl -u <username>:<password> --request POST \
  --url <kibana_url>/internal/fleet/reset_preconfigured_agent_policies/policy-elastic-agent-on-cloud \
  --header 'content-type: application/json' \
  --header 'kbn-xsrf: xyz' \
  --header 'elastic-api-version: 1'

On {kib} versions earlier than 8.11, run:

curl -u <username>:<password> --request POST \
  --url <kibana_url>/internal/fleet/reset_preconfigured_agent_policies/policy-elastic-agent-on-cloud \
  --header 'content-type: application/json' \
  --header 'kbn-xsrf: xyz'

Force unenroll the agent stuck in Updating:

To find agent’s ID, go to {fleet} > Agents and click the agent to see its details. Copy the Agent ID.

In a terminal window, run:

curl -u <username>:<password> --request POST \
  --url <kibana_url>/api/fleet/agents/<agentID>/unenroll \
  --header 'content-type: application/json' \
  --header 'kbn-xsrf: xx' \
  --data-raw '{"force":true,"revoke":true}' \
  --compressed

Where <agentID> is the ID you copied in the previous step.

Restart the {integrations-server}:

In the {ecloud} console under {integrations-server}, click Force Restart.

When using {ecloud}, {fleet-server} is not listed in {kib}

If you are unable to see {fleet-server} in {kib}, make sure it’s set up.

To set up {fleet-server} on {ecloud}:

Go to your deployment on {ecloud}.
Follow the {ecloud} prompts to set up {integrations-server}. Once complete, the {fleet-server} {agent} will show up in {fleet}.

To enable {fleet} and set up {fleet-server} on a self-managed cluster:

In the {es} configuration file, config/elasticsearch.yml, set the following security settings to enable security and API keys:
```
xpack.security.enabled: true
xpack.security.authc.api_key.enabled: true
```
In the {kib} configuration file, config/kibana.yml, enable {fleet} and specify your user credentials:
```
xpack.encryptedSavedObjects.encryptionKey: "something_at_least_32_characters"
elasticsearch.username: "my_username" (1)
elasticsearch.password: "my_password"
```
1. Specify a user who is authorized to use {fleet}.
  
  To set up passwords, you can use the documented {es} APIs or the elasticsearch-setup-passwords command. For example, ./bin/elasticsearch-setup-passwords auto
  
  After running the command:
  1. Copy the Elastic user name to the {kib} configuration file.
  2. Restart {kib}.
  3. Follow the documented steps for setting up a self-managed {fleet-server}. For more information, refer to [fleet-server].

The `/api/fleet/setup` endpoint can’t reach the package registry

To install {integrations}, the {fleet} app requires a connection to an external service called the {package-registry}.

For this to work, the {kib} server must connect to https://epr.elastic.co on port 443.

{kib} cannot connect to {package-registry} in air-gapped environments

In air-gapped environments, you may encounter the following error if you’re using a custom Certificate Authority (CA) that is not available to {kib}:

{"type":"log","@timestamp":"2022-03-02T09:58:36-05:00","tags":["error","plugins","fleet"],"pid":58716,"message":"Error connecting to package registry: request to https://customer.server.name:8443/categories?experimental=true&include_policy_templates=true&kibana.version=7.17.0 failed, reason: self signed certificate in certificate chain"}

To fix this problem, add your CA certificate file path to the {kib} startup file by defining the NODE_EXTRA_CA_CERTS environment variable. More information about this in [air-gapped-tls] section.

{fleet} in {kib} crashes

To investigate the error, open your browser’s development console.
Select the Network tab, and refresh the page.

One of the requests to the {fleet} API will most likely have returned an error. If the error message doesn’t give you enough information to fix the problem, please contact us in the {forum}[discuss forum].

{agent} enrollment fails on the host with `x509: certificate signed by unknown authority` message

To ensure that communication with {fleet-server} is encrypted, {fleet-server} requires {agent}s to present a signed certificate. In a self-managed cluster, if you don’t specify certificates when you set up {fleet-server}, self-signed certificates are generated automatically.

If you attempt to enroll an {agent} in a {fleet-server} with a self-signed certificate, you will encounter the following error:

Error: fail to enroll: fail to execute request to fleet-server: x509: certificate signed by unknown authority
Error: enroll command failed with exit code: 1

To fix this problem, pass the --insecure flag along with the enroll or install command. For example:

sudo ./elastic-agent install --url=https://<fleet-server-ip>:8220 --enrollment-token=<token> --insecure

Traffic between {agent}s and {fleet-server} over HTTPS will be encrypted; you’re simply acknowledging that you understand that the certificate chain cannot be verified.

Allowing {fleet-server} to generate self-signed certificates is useful to get things running for development, but not recommended in a production environment.

For more information, refer to [secure-connections].

{agent} enrollment fails on the host with `x509: cannot validate certificate for x.x.x.x because it doesn’t contain any IP SANs` message

To ensure that communication with {es} is encrypted, {fleet-server} requires {es} to present a signed certificate.

This error occurs when you use self-signed certificates with {es} using IP as a Common Name (CN). With IP as a CN, {fleet-server} looks into subject alternative names (SANs), which is empty. To work around this situation, use the --fleet-server-es-insecure flag to disable certificate verification.

You will also need to set ssl.verification_mode: none in the Output settings in {fleet} and {integrations} UI.

{agent} enrollment fails on the host with `Client.Timeout exceeded` message

To enroll in {fleet}, {agent} must connect to the {fleet-server} instance. If the agent is unable to connect, you see the following failure:

fail to enroll: fail to execute request to {fleet-server}:Post http://fleet-server:8220/api/fleet/agents/enroll?: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Here are several steps to help you troubleshoot the problem.

Check for networking problems. From the host, run the ping command to confirm that it can reach the {fleet-server} instance.

Additionally, curl the /status API of {fleet-server}:

curl -f http://<fleet-server-url>:8220/api/status

Verify that you have specified the correct {kib} {fleet} settings URL and port for your environment.

By default, HTTPS protocol and port 8220 is expected by {fleet-server} to communicate with {es} unless you have explicitly set it otherwise.
Check that you specified a valid enrollment key during enrollment. To do this:
1. In {fleet}, select Enrollment tokens.
2. To view the secret, click the eyeball icon. The secret should match the string that you used to enroll {agent} on your host.
3. If the secret doesn’t match, create a new enrollment token and use this token when you run the elastic-agent enroll command.

Many {fleet-server} problems can be triaged and fixed with the below tips

Important

When creating an issue or sending a support forum communication, this section can help you identify what is required.

{fleet-server} allows {agent} to connect to {es}, which is the same as the connection to {kib} in prior releases. However, because {fleet-server} is on the edge host, it may result in additional networking setup and troubleshooting.

Retrieve the {agent} version

If you installed the {agent}, run the following command (the example is for POSIX based systems):
```
elastic-agent version
```

If you have not installed the {agent} and you are running it as a temporary process, you can run:

./elastic-agent version

Note	Both of the above commands are accessible via Windows or macOS with their OS-specific slight variation in how you call them. If needed, please refer to [elastic-agent-installation] for examples of how to adjust them.

Check the {agent} status

Run the following command to view the current status of the {agent}.

elastic-agent status

Based on the information returned, you can take further action.

If {agent} is running, but you do not see what you expect, here are some items to review:

In {fleet}, click Agents. Check which policy is associated with the running {agent}. If it is not the policy you expected, you can change it.
In {fleet}, click Agents, and then select the {agent} policy. Check for the integrations that should be included.

For example, if you want to include system data, make sure the System integration is included in the policy.
Confirm if the Collect agent logs and Collect agent metrics options are selected.
1. In {fleet}, click Agents, and then select the {agent} policy.
2. Select the Settings tab. If you want to collect agent logs or metrics, select these options.
  
  Important
  The {ecloud} agent policy is created only in {ecloud} deployments and, by default, does not include the collection of logs of metrics.

Collect {agent} diagnostics bundle

The {agent} diagnostics bundle collects the following information:

{agent} versions numbers
{beats} (and other process) version numbers and process metadata
Local configuration, elastic-agent policy, and the configuration that is rendered and passed to {beats} and other processes
{agent}'s local log files
{agent} and {beats} pprof profiles

Note that the diagnostics bundle is intended for debugging purposes only, its structure may change between releases.

Important

{agent} attempts to automatically redact credentials and API keys when creating diagnostics. Please review the contents of the archive before sharing to ensure that there are no credentials in plain text.

Important

The ZIP archive containing diagnostics information will include the raw events of documents sent to the {agent} output. By default, it will log only the failing events as warn. When the debug logging level is enabled, all events are logged. Please review the contents of the archive before sharing to ensure that no sensitive information is included.

Get the diagnostics bundle using the CLI

Run the following command to generate a zip archive containing diagnostics information that the Elastic team can use for debugging cases.

elastic-agent diagnostics

If you want to omit the raw events from the diagnostic, add the flag --exclude-events.

Get the diagnostics bundle through {fleet}

{fleet} provides the ability to remotely generate and gather an {agent}'s diagnostics bundle. An agent can gather and upload diagnostics if it is online in a Healthy or Unhealthy state. The diagnostics are sent to {fleet-server} which in turn adds it into {es}. Therefore, this works even with {agents} that are not using the {es} output. To download the diagnostics bundle for local viewing:

In {fleet}, open the Agents tab.
In the Host column, click the agent’s name.
Select the Diagnostics tab and click the Request diagnostics .zip button.
In the Request Diagnostics pop-up, select Collect additional CPU metrics if you’d like detailed CPU data.
Click the Request diagnostics button.

When available, the new diagnostic bundle will be listed on this page, as well as any in-progress or previously collected bundles for the {agent}.

Note that the bundles are stored in {es} and are removed automatically after 7 days. You can also delete any previously created bundle by clicking the trash can icon.

Some problems occur so early that insufficient logging is available

If some problems occur early and insufficient logging is available, run the following command:

./elastic-agent install -f

The stand-alone install command installs the {agent}, and all of the service configuration is set up. You can now run the 'enrollment' command. For example:

elastic-agent enroll --fleet-server-es=https://<es-url>:443 --fleet-server-service-token=<token> --fleet-server-policy=<policy-id>

Note: Port 443 is commonly used in {ecloud}. However, with self-managed deployments, your {es} may run on port 9200 or something entirely different.

For information on where to find agent logs, refer to our FAQ.

The {agent} is cited as `Healthy` but still has set up problems sending data to {es}

To confirm that the {agent} is running and its status is Healthy, select the Agents tab.

If you previously selected the Collect agent logs option, you can now look at the agent logs.
Click the agent name and then select the Logs tab.

If there are no logs displayed, it suggests a communication problem between your host and {es}. The possible reason for this is that the port is already in use.
You can check the port usage using tools like Wireshark or netstat. On a POSIX system, you can run the following command:
```
netstat -nat | grep :8220
```
Any response data indicates that the port is in use. This could be correct or not if you had intended to uninstall the {fleet-server}. In which case, re-check and continue.

{agent} is stuck in status `Updating`

Beginning in {stack} version 8.11, a stuck {agent} upgrade should be detected automatically, and you can restart the upgrade from {fleet}.

{fleet-server} is running and healthy with data, but other Agents cannot use it to connect to {es}

Some settings are only used when you have multiple {agent}s. If this is the case, it may help to check that the hosts can communicate with the {fleet-server}.

From the non-{fleet-server} host, run the following command:

curl -f http://<fleet-server-ip>:8220/api/status

The response may yield errors that you can be debug further, or it may work and show that communication ports and networking are not the problems.

One common problem is that the default {fleet-server} port of 8220 isn’t open on the {fleet-server} host to communicate. You can review and correct this using common tools in alignment with any networking and security concerns you may have.

{es} authentication service fails with `Authentication using apikey failed` message

To save API keys and encrypt them in {es}, {fleet} requires an encryption key.

To provide an API key, in the kibana.yml configuration file, set the xpack.encryptedSavedObjects.encryptionKey property.

xpack.encryptedSavedObjects.encryptionKey: "something_at_least_32_characters"

{agent} fails with `Agent process is not root/admin or validation failed` message

Ensure the user running {agent} has root privileges as some integrations require root privileges to collect sensitive data.

If you’re running {agent} in the foreground (and not as a service) on Linux or macOS, run the agent under the root user: sudo or su.

If you’re using the {elastic-defend} integration, make sure you’re running {agent} under the SYSTEM account.

Tip	If you install {agent} as a service as described in [elastic-agent-installation], {agent} runs under the SYSTEM account by default.

To run {agent} under the SYSTEM account, you can do the following:

Download PsExec and extract the contents to a folder. For example, d:\tools.
Open a command prompt as an Administrator (right-click the command prompt icon and select Run As Administrator).

From the command prompt, run {agent} under the SYSTEM account:

d:\tools\psexec.exe -sid "C:\Program Files\Elastic-Agent\elastic-agent.exe" run

Integration policy upgrade has too many conflicts

If you try to upgrade an integration policy that is several versions old, there may be substantial conflicts or configuration issues. Rather than trying to fix these problems, it might be faster to create a new policy, test it, and roll out the integration upgrade to additional hosts.

After upgrading the integration:

Create a new policy.
Add the integration to the policy. The newer version is automatically used.

Apply the policy to an {agent}.

Tip	In larger deployments, you should test integration upgrades on a sample {agent} before rolling out a larger upgrade initiative. Only after a small trial is deemed successful should the updated policy be rolled out all hosts.

Roll out the integration update to additional hosts:

In {fleet}, click Agent policies. Click on the name of the policy you want to edit.
Search or scroll to a specific integration. Open the Actions menu and select Delete integration.
Click Add integration and re-add the freshly deleted integration. The updated version will be used and applied to all {agent}s.

Repeat this process for each policy with the out-of-date integration.

Note

In some instances, for example, when there are hundreds or thousands of different {agent}s and policies that need to be updated, this upgrade path is not feasible. In this case, update one policy and use the [copy-policy] action to apply the updated policy versions to additional policies. This method’s downside is losing the granularity of assessing the individual Integration version changes individually across policies.

{agent} hangs while unenrolling

When unenrolling {agent}, {fleet} waits for acknowledgment from the agent before it completes the unenroll process. If {fleet} doesn’t receive an acknowledgment, the status hangs at unenrolling.

You can unenroll an agent to invalidate all API keys related to the agent and change the status to inactive so that the agent no longer appears in {fleet}.

In {fleet}, select Agents.
Under Agents, choose Unenroll agent from the Actions menu next to the agent you want to unenroll.
Click Force unenroll.

On {fleet-server} startup, ERROR seen with `State changed to CRASHED: exited with code: 1`

You may see this error message for a number of different reasons. A common reason is when attempting production-like usage and the ca.crt file passed in cannot be found. To verify if this is the problem, bootstrap {fleet-server} without passing a ca.crt file. This implies you would test any subsequent {agent} installs temporarily with {fleet-sever}'s own self-signed cert.

Tip	Ensure to pass in the full path to the ca.crt file. A relative path is not viable.

You will know if your {fleet-server} is set up with its testing oriented self-signed certificate usage, when you see the following error during {agent} installs:

Error: fail to enroll: fail to execute request to fleet-server: x509: certificate signed by unknown authority
Error: enroll command failed with exit code: 1

To install or enroll against a self-signed cert {fleet-server} {agent}, add in the --insecure option to the command:

sudo ./elastic-agent install --url=https://<fleet-server-ip>:8220 --enrollment-token=<token> --insecure

For more information, refer to {agent} enrollment fails on the host with x509: certificate signed by unknown authority message.

Uninstalling {elastic-endpoint} fails

When you uninstall {agent}, all the programs managed by {agent}, such as {elastic-endpoint}, are also removed. If uninstalling fails, {elastic-endpoint} might remain on your system.

To remove {elastic-endpoint}, run the following commands:

{ingest-docs-root}/docs/en/ingest-management/tab-widgets/remove-endpoint-files/widget.asciidoc

API key is unauthorized to send telemetry to `.logs-endpoint.diagnostic.collection-*` indices

By default, telemetry is turned on in the {stack} to helps us learn about the features that our users are most interested in. This helps us to focus our efforts on making features even better.

If you’ve recently upgraded from version 7.10 to 7.11, you might see the following message when you view {elastic-defend} logs:

action [indices:admin/auto_create] is unauthorized for API key id [KbvCi3YB96EBa6C9k2Cm]
of user [fleet_enroll] on indices [.logs-endpoint.diagnostic.collection-default]

The above message indicates that {elastic-endpoint} does not have the correct permissions to send telemetry. This is a known problem in 7.11 that will be fixed in an upcoming patch release.

To remove this message from your logs, you can turn off telemetry for the {elastic-defend} integration until the next patch release is available.

In {kib}, click Integrations, and then select the Manage tab.
Click {elastic-defend}, and then select the Policies tab to view all the installed integrations.
Click the integration to edit it.
Under advanced settings, set windows.advanced.diagnostic.enabled to false, and then save the integration.

Hosted {agent} is offline

To scale the {fleet-server} deployment, {ecloud} starts new containers or shuts down old ones when hosted {agent}s are required or no longer needed. The old {agent}s will show in the Agents list for 24 hours then automatically disappear.

{agent} fails to enroll with {fleet-server} running on localhost.

If you’re testing {fleet-server} locally on a macOS system using localhost (https://127.0.0.1:8220) as the Host URL, you may encounter this error:

Error: fail to enroll: fail to execute request to fleet-server:
lookup My-MacBook-Pro.local: no such host

This can occur on newer macOS software. To resolve the problem, ensure that file sharing is enabled on your local system.

APM & {fleet} fails to upgrade to 8.x on {ecloud}

In some scenarios, upgrading APM & {fleet} to 8.x may fail if the {ecloud} agent policy was modified manually. The {fleet} app in {kib} may show a message like:

Unable to create package policy. Package 'apm' already exists on this agent policy

To work around this problem, you can reset the {ecloud} agent policy with an API call. Note that this will remove any custom integration policies that you’ve added to the policy, such as Synthetics monitors.

curl -u elastic:<password> --request POST \
  --url <kibana_url>/internal/fleet/reset_preconfigured_agent_policies/policy-elastic-agent-on-cloud \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: xyz'

Air-gapped {agent} upgrade can fail due to an inaccessible PGP key

In versions 8.9 and above, an {agent} upgrade may fail when the upgrader can’t access a PGP key required to verify the binary signature. For details and a workaround, refer to the PGP key download fails in an air-gapped environment known issue in the version 8.9.0 Release Notes or to the workaround documentation in the elastic-agent GitHub repository.

{agents} are unable to connect after removing the {fleet-server} integration

When you use {fleet}-managed {agent}, at least one {agent} needs to be running the {fleet-server} integration. In case the policy containing this integration is accidentally removed from {agent}, all other agents will not be able to be managed. However, the {agents} will continue to send data to their configured output.

There are two approaches to fixing this issue, depending on whether or not the the {agent} that was running the {fleet-server} integration is still installed and healthy (but is now running another policy).

To recover the {agent}:

In {fleet}, open the Agents tab and click Add agent.
In the Add agent flyout, select an agent policy that contains the Fleet Server integration. On Elastic Cloud you can use the Elastic Cloud agent policy which includes the integration.
Follow the instructions in the flyout, and stop before running the CLI commands.
Depending on the state of the original {fleet-server} {agent}, do one of the following:
- The original {fleet-server} {agent} is still running and healthy
  
  In this case, you only need to re-enroll the agent with {fleet}:
  1. Copy the elastic-agent install command from the {kib} UI.
  2. In the command, replace install with enroll.
  3. In the directory where {agent} is running (for example /opt/Elastic/Agent/ on Linux), run the command as root.
    
    For example, if {kib} gives you the command:
    
    sudo ./elastic-agent install --url=https://fleet-server:8220 --enrollment-token=bXktc3VwZXItc2VjcmV0LWVucm9sbWVudC10b2tlbg==
    
    Instead run:
    
    sudo ./elastic-agent enroll --url=https://fleet-server:8220 --enrollment-token=bXktc3VwZXItc2VjcmV0LWVucm9sbWVudC10b2tlbg==
- The original {fleet-server} {agent} is no longer installed
  
  In this case, you need to install the agent again:
  1. Copy the commands from the {kib} UI. The commands don’t need to be changed.
  2. Run the commands in order. The first three commands will download a new {agent} install package, expand the archive, and change directories.
    
    The final command will install {agent}. For example:
    
    sudo ./elastic-agent install --url=https://fleet-server:8220 --enrollment-token=bXktc3VwZXItc2VjcmV0LWVucm9sbWVudC10b2tlbg==

After running these steps your {agents} should be able to connect with {fleet} again.

{agent} Out of Memory errors on Kubernetes

In a Kubernetes environment, {agent} may be terminated with reason OOMKilled due to inadequate available memory.

To detect the problem, run the kubectl describe pod command and check the results for the following content:

       Last State:   Terminated
       Reason:       OOMKilled
       Exit Code:    137

To resolve the problem, allocate additional memory to the agent and then restart it.

Error when running {agent} commands with `sudo`

On Linux systems, when you install {agent} without administrative privileges, that is, using the --unprivileged flag, {agent} commands should not be run with sudo. Doing so may result in an error due to the agent not having the required privileges.

For example, when you run {agent} with the --unprivileged flag, running the elastic-agent inspect command will result in an error like the following:

Error: error loading agent config: error loading raw config: fail to read configuration /Library/Elastic/Agent/fleet.enc for the elastic-agent: fail to decode bytes: cipher: message authentication failed

To resolve this, either install {agent} without the --unprivileged flag so that it has administrative access, or run the {agent} commands without the sudo prefix.

Troubleshoot {agent} installation on Kubernetes, with Kustomize

Potential issues during {agent} installation on Kubernetes can be categorized into two main areas:

Problems related to the creation of objects within the manifest.
Failures occurring within specific components after installation.

Problems related to the creation of objects within the manifest

When troubleshooting installations performed with Kustomize, it’s good practice to inspect the output of the rendered manifest. To do this, take the installation command provided by Kibana Onboarding and replace the final part, | kubectl apply -f-, with a redirection to a local file. This allows for easier analysis of the rendered output.

For example, the following command, originally provided by {kib} for an {agent} Standalone installation, has been modified to redirect the output for troubleshooting purposes:

kubectl kustomize https://github.com/elastic/elastic-agent/deploy/kubernetes/elastic-agent-kustomize/default/elastic-agent-standalone\?ref\=v8.15.3 | sed -e 's/JUFQSV9LRVkl/ZDAyNnZaSUJ3eWIwSUlCT0duRGs6Q1JfYmJoVFRUQktoN2dXTkd0FNMtdw==/g' -e "s/%ES_HOST%/https:\/\/7a912e8674a34086eacd0e3d615e6048.us-west2.gcp.elastic-cloud.com:443/g" -e "s/%ONBOARDING_ID%/db687358-2c1f-4ec9-86e0-8f1baa4912ed/g" -e "s/\(docker.elastic.co\/beats\/elastic-agent:\).*$/\18.15.3/g" -e "/{CA_TRUSTED}/c\ " > elastic_agent_installation_complete_manifest.yaml

The previous command generates a local file named elastic_agent_installation_complete_manifest.yaml, which you can use for further analysis. It contains the complete set of resources required for the {agent} installation, including:

RBAC objects (ServiceAccounts, Roles, etc.)
ConfigMaps and Secrets for {agent} configuration
{agent} Standalone deployed as a DaemonSet
Kube-state-metrics deployed as a Deployment

The content of this file is equivalent to what you’d obtain by following the [running-on-kubernetes-standalone] steps, with the exception that kube-state-metrics is not included in the standalone method.

Possible issues

If your user doesn’t have cluster-admin privileges, the RBAC resources creation might fail.
Some Kubernetes security mechanisms (like Pod Security Standards) could cause part of the manifest to be rejected, as hostNetwork access and hostPath volumes are required.
If you already have an installation of kube-state-metrics, it could cause part of the manifest installation to fail or to update your existing resources without notice.

Failures occurring within specific components after installation

If the installation is correct and all resources are deployed, but data is not flowing as expected (for example, you don’t see any data on the [Metrics Kubernetes] Cluster Overview dashboard), check the following items:

Check resources status and ensure they are all in a Running state:

kubectl get pods -n kube-system | grep elastic
kubectl get pods -n kube-system | grep kube-state-metrics

Note	The default configuration assumes that both `kube-state-metrics` and the {agent} `DaemonSet` are deployed in the same namespace for communication purposes. If you change the namespace of any of the components, the agent configuration will need further policy updates.

Describe the Pods if they are in a Pending state:

kubectl describe -n kube-system <name_of_elastic_agent_pod>

Check the logs of elastic-agents and kube-state-metrics, and look for errors or warnings:

kubectl logs -n kube-system <name_of_elastic_agent_pod>
kubectl logs -n kube-system <name_of_elastic_agent_pod> | grep -i error
kubectl logs -n kube-system <name_of_elastic_agent_pod> | grep -i warn

kubectl logs -n kube-system <name_of_kube-state-metrics_pod>

Possible issues

Connectivity, authorization, or authentication issues when connecting to {es}:

Ensure the API Key and {es} destination endpoint used during the installation is correct and is reachable from within the Pods.

In an already installed system, the API Key is stored in a Secret named elastic-agent-creds-<hash>, and the endpoint is configured in the ConfigMap elastic-agent-configs-<hash>.
Missing cluster-level metrics (provided by kube-state-metrics):

As described in [running-on-kubernetes-standalone], the {agent} Pod acting as leader is responsible for retrieving cluster-level metrics from kube-state-metrics and delivering them to {ref}/data-streams.html[data streams] prefixed as metrics-kubernetes.state_<resource>. In order to troubleshoot a situation where these metrics are not appearing:
1. Determine which Pod owns the leadership lease in the cluster, with:
  kubectl get lease -n kube-system elastic-agent-cluster-leader
2. Check the logs of that Pod to see if there are errors when connecting to kube-state-metrics and if the state_* metrics are being sent to {es}.
  
  One way to check if state_* metrics are being delivered to {es} is to inspect log lines with the "Non-zero metrics in the last 30s" message and check the values of the state_* metrics within the line, with something like:
  kubectl logs -n kube-system elastic-agent-xxxx | grep "Non-zero metrics" | grep "state_"
  If the previous command returns "state_pod":{"events":213,"success":213} or similar for all state_* metrics, it means the metrics are being delivered.
3. As a last resort, if you believe none of the Pods is acting as a leader, you can try deleting the lease to generate a new one:
  kubectl delete lease -n kube-system elastic-agent-cluster-leader # wait a few seconds and check for the lease again kubectl get lease -n kube-system elastic-agent-cluster-leader
Performance problems:

Monitor the CPU and Memory usage of the agents Pods and adjust the manifest requests and limits as needed. Refer to [scaling-on-kubernetes] for more details about the needed resources.

Extra resources for {agent} on Kubernetes troubleshooting and information:

{agent} Out of Memory errors on Kubernetes.
{agent} Kustomize Templates documentation and resources.
Other examples and manifests to deploy {agent} on Kubernetes.

Troubleshoot {agent} on Kubernetes seeing `invalid api key to authenticate with fleet` in logs

If an agent was unenrolled from a Kubernetes cluster, there might be data remaining in /var/lib/elastic-agent-managed/kube-system/state on the node(s). Reenrolling an agent later on the same nodes might then result in invalid api key to authenticate with fleet error messages.

To avoid these errors, make sure to delete this state-folder before enrolling a new agent.

For more information, refer to issue #3586.

Files

troubleshooting.asciidoc

Latest commit

History

troubleshooting.asciidoc

File metadata and controls

Troubleshoot common problems

Troubleshooting contents

{agent} unenroll fails

illegal_argument_exception when TSDB is enabled

{agent}s hosted on {ecloud} are stuck in Updating or Offline

When using {ecloud}, {fleet-server} is not listed in {kib}

The /api/fleet/setup endpoint can’t reach the package registry

{kib} cannot connect to {package-registry} in air-gapped environments

{fleet} in {kib} crashes

{agent} enrollment fails on the host with x509: certificate signed by unknown authority message

{agent} enrollment fails on the host with x509: cannot validate certificate for x.x.x.x because it doesn’t contain any IP SANs message

{agent} enrollment fails on the host with Client.Timeout exceeded message

Many {fleet-server} problems can be triaged and fixed with the below tips

Retrieve the {agent} version

Check the {agent} status

Collect {agent} diagnostics bundle

Some problems occur so early that insufficient logging is available

The {agent} is cited as Healthy but still has set up problems sending data to {es}

{agent} is stuck in status Updating

{fleet-server} is running and healthy with data, but other Agents cannot use it to connect to {es}

{es} authentication service fails with Authentication using apikey failed message

{agent} fails with Agent process is not root/admin or validation failed message

Integration policy upgrade has too many conflicts

{agent} hangs while unenrolling

On {fleet-server} startup, ERROR seen with State changed to CRASHED: exited with code: 1

Uninstalling {elastic-endpoint} fails

API key is unauthorized to send telemetry to .logs-endpoint.diagnostic.collection-* indices

Hosted {agent} is offline

{agent} fails to enroll with {fleet-server} running on localhost.

APM & {fleet} fails to upgrade to 8.x on {ecloud}

Air-gapped {agent} upgrade can fail due to an inaccessible PGP key

{agents} are unable to connect after removing the {fleet-server} integration

{agent} Out of Memory errors on Kubernetes

Error when running {agent} commands with sudo

Troubleshoot {agent} installation on Kubernetes, with Kustomize

Problems related to the creation of objects within the manifest

Failures occurring within specific components after installation

Troubleshoot {agent} on Kubernetes seeing invalid api key to authenticate with fleet in logs

{agent}s hosted on {ecloud} are stuck in `Updating` or `Offline`

The `/api/fleet/setup` endpoint can’t reach the package registry

{agent} enrollment fails on the host with `x509: certificate signed by unknown authority` message

{agent} enrollment fails on the host with `x509: cannot validate certificate for x.x.x.x because it doesn’t contain any IP SANs` message

{agent} enrollment fails on the host with `Client.Timeout exceeded` message

The {agent} is cited as `Healthy` but still has set up problems sending data to {es}

{agent} is stuck in status `Updating`

{es} authentication service fails with `Authentication using apikey failed` message

{agent} fails with `Agent process is not root/admin or validation failed` message

On {fleet-server} startup, ERROR seen with `State changed to CRASHED: exited with code: 1`

API key is unauthorized to send telemetry to `.logs-endpoint.diagnostic.collection-*` indices

Error when running {agent} commands with `sudo`

Troubleshoot {agent} on Kubernetes seeing `invalid api key to authenticate with fleet` in logs