Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent does not apply an invalid proxy for fleet-server, but shows as unhealthy #4472

Open
AndersonQ opened this issue Mar 22, 2024 · 11 comments
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@AndersonQ
Copy link
Member

AndersonQ commented Mar 22, 2024

Screenshot 2024-04-22 at 4 29 51 PM

Steps to Reproduce:

  • find Elastic cloud and artifacts api IPs:
nslookup ES_HOST
## address from fleetUI
nslookup 9a565b4629ba489e92c2d1ce5f829741.us-west2.gcp.elastic-cloud.com
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
9a565b4629ba489e92c2d1ce5f829741.us-west2.gcp.elastic-cloud.com	canonical name = proxy-production-us-west2.gcp.elastic-cloud.com.
proxy-production-us-west2.gcp.elastic-cloud.com	canonical name = proxy-production-us-west2-v2.gcp.elastic-cloud.com.
Name:	proxy-production-us-west2-v2.gcp.elastic-cloud.com
Address: 35.235.72.223

## address from cloud UI
nslookup my-deployment-cc39c3.es.us-west2.gcp.elastic-cloud.com
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
my-deployment-cc39c3.es.us-west2.gcp.elastic-cloud.com	canonical name = proxy-production-us-west2.gcp.elastic-cloud.com.
proxy-production-us-west2.gcp.elastic-cloud.com	canonical name = proxy-production-us-west2-v2.gcp.elastic-cloud.com.
Name:	proxy-production-us-west2-v2.gcp.elastic-cloud.com
Address: 35.235.72.223

## address from fleetUI
nslookup artifacts.elastic.co
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
artifacts.elastic.co	canonical name = infra-cdn.elastic.co.
Name:	infra-cdn.elastic.co
Address: 34.120.127.130
Name:	infra-cdn.elastic.co
Address: 2600:1901:0:1d7::
  • block the IPs
iptables -A INPUT -j DROP -d 34.120.127.130
iptables -A OUTPUT -j DROP -d 34.120.127.130
ip6tables -A OUTPUT -j DROP -d 2600:1901:0:1d7::
ip6tables -A INPUT -j DROP -d 2600:1901:0:1d7::
  • run squid proxy (http://10.80.40.162:3128) on another VM with an allow all config
  • add the proxy on FleetUI for the ES output, fleet-server and agent binary download
  • install the agent
./elastic-agent-8.13.0-linux-x86_64/elastic-agent install -nf --url=https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp.elastic-cloud.com:443 --enrollment-token=ENROLLMENT_TOKEN --proxy-url=http://10.80.40.162:3128
  • add an invalid proxy (http://10.40.80.1:8888) on Fleet settings
  • add the invalid proxy to fleet server
  • agent status show as failed:
Every 2.0s: /opt/Elastic/Agent/elastic-agent stat...  elastic-agent: Wed Mar 20 16:43:42 2024

┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (FAILED) Actions: fail to communicate with Fleet Server API client hosts: all h
osts failed: 1 error occurred:
   │      * requester 0/1 to host https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp
.elastic-cloud.com:443/ errored: Get "https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2
.gcp.elastic-cloud.com:443/api/status?": context deadline exceeded
   │
   │
   ├─ info
   │  ├─ id: 287e45c6-635e-4461-8c85-4d58704172d2
   │  ├─ version: 8.13.0
   │  └─ commit: 533443d148f4cf71e7c3e8efb736eda8275c4f69
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '41285'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '41322'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '41294'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ log-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '41251'
   │  ├─ log-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ log-default-logfile-system-d223f8b3-cd86-4c37-8fed-a00e6fbbf3cd
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '41269'
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ system/metrics-default-system/metrics-system-d223f8b3-cd86-4c37-8fed-a00e6fbbf3cd

The status does eventually clear if you delete the incorrect proxy.

@AndersonQ AndersonQ added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Mar 22, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pierrehilbert
Copy link
Contributor

Are we sure the config is not applied for the fleet-server part?

@AndersonQ
Copy link
Member Author

I could check again, but yes, the agent was not applying the config. A simple test is to reproduce the issue and fix the proxy in the policy and observe the agent will report as health again

@cmacknz
Copy link
Member

cmacknz commented Mar 25, 2024

┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
├─ status: (FAILED) Actions: fail to communicate with Fleet Server API client hosts: all h
osts failed: 1 error occurred:
│ * requester 0/1 to host https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp
.elastic-cloud.com:443/ errored: Get "https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2
.gcp.elastic-cloud.com:443/api/status?": context deadline exceeded

Why is the Fleet status healthy but the agent status isn't? The reason we use a separate Fleet status in the first place was so we'd stop considering transient Fleet errors a reason why the agent would be unhealthy (and if the agent is offline, it can't report Fleet status anyway).

The error appears to be coming from:

resp, err := client.Send(ctx, http.MethodGet, "/api/status", nil, nil, nil)
if err != nil {
return errors.New(
err, "fail to communicate with Fleet Server API client hosts",
errors.TypeNetwork, errors.M("hosts", h.config.Fleet.Client.Hosts))
}

I think that function might be globally setting the agent status regardless of where it was called from:

// Using the same lock that was used for sorting above
c.clientLock.Lock()
requester.SetLastError(err)
c.clientLock.Unlock()

@AndersonQ
Copy link
Member Author

Why is the Fleet status healthy but the agent status isn't?

I thought it was global-ish error state for the fleetclient but perhaps it isn't. As you pointed out, the flle status is healthy, what is correct. And paying more attention at the error, it startes with Action: which leads me to believe this error is set because the Polich Change action failed. What is indeed correct, but the way it's presented is confusing.

I had a quick look at the code, and I believe here is where the error is collected and set on the agent status

} else if c.actionsErr != nil {
s.State = agentclient.Failed
s.Message = fmt.Sprintf("Actions: %s", c.actionsErr.Error())

@cmacknz
Copy link
Member

cmacknz commented Mar 26, 2024

What clears that error once it is set? Another successful action?

@AndersonQ
Copy link
Member Author

@cmacknz, IIRC, yes, a successful action would clear the error.

@pierrehilbert @cmacknz it's still relevant right?

@nimarezainia
Copy link
Contributor

nimarezainia commented Jun 6, 2024

I would say this is very relevant. Perhaps even related to this: https://github.com/elastic/ingest-dev/issues/3234
We do want to inform the user if there are proxy issues, ideally before the config is applied.

@AndersonQ
Copy link
Member Author

@nimarezainia, what do you mean by informing the user before the config is applied?

I'm wondering if you mean some how test it before sending to the agents.
The only way to be 100% sure the proxy config indeed work is sending it to the agent so the agent can test it. And it is per agent, the same config might be valid for one agent but invalid for another.

@amitkanfer
Copy link
Contributor

I believe Nima is referring to two-phase commit protocol which i don't think we want to focus on right now.. basically all agents report back to fleet server that a new config is valid (eg. "prepare"), and only then the "commit" phase happens where all agents apply the new config.

@nimarezainia
Copy link
Contributor

Yes a two commit would work. Many of these configs (as @AndersonQ stated) would need to be tested at the agent itself. I am thinking mainly of connectivity related configurations, like the connection to Fleet Server, Outputs or the Download, before that config is applied, test whether you even have a route to the endpoint. Then apply/commit the configuration. If the test fails, don't change the config and flag this.

We don't want a small mistake in the configuration to bring down the whole Fleet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

6 participants