Agent does not apply an invalid proxy for fleet-server, but shows as unhealthy #4472

AndersonQ · 2024-03-22T16:18:20Z

Steps to Reproduce:

find Elastic cloud and artifacts api IPs:

nslookup ES_HOST
## address from fleetUI
nslookup 9a565b4629ba489e92c2d1ce5f829741.us-west2.gcp.elastic-cloud.com
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
9a565b4629ba489e92c2d1ce5f829741.us-west2.gcp.elastic-cloud.com	canonical name = proxy-production-us-west2.gcp.elastic-cloud.com.
proxy-production-us-west2.gcp.elastic-cloud.com	canonical name = proxy-production-us-west2-v2.gcp.elastic-cloud.com.
Name:	proxy-production-us-west2-v2.gcp.elastic-cloud.com
Address: 35.235.72.223

## address from cloud UI
nslookup my-deployment-cc39c3.es.us-west2.gcp.elastic-cloud.com
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
my-deployment-cc39c3.es.us-west2.gcp.elastic-cloud.com	canonical name = proxy-production-us-west2.gcp.elastic-cloud.com.
proxy-production-us-west2.gcp.elastic-cloud.com	canonical name = proxy-production-us-west2-v2.gcp.elastic-cloud.com.
Name:	proxy-production-us-west2-v2.gcp.elastic-cloud.com
Address: 35.235.72.223

## address from fleetUI
nslookup artifacts.elastic.co
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
artifacts.elastic.co	canonical name = infra-cdn.elastic.co.
Name:	infra-cdn.elastic.co
Address: 34.120.127.130
Name:	infra-cdn.elastic.co
Address: 2600:1901:0:1d7::

block the IPs

iptables -A INPUT -j DROP -d 34.120.127.130
iptables -A OUTPUT -j DROP -d 34.120.127.130
ip6tables -A OUTPUT -j DROP -d 2600:1901:0:1d7::
ip6tables -A INPUT -j DROP -d 2600:1901:0:1d7::

run squid proxy (http://10.80.40.162:3128) on another VM with an allow all config
add the proxy on FleetUI for the ES output, fleet-server and agent binary download
install the agent

./elastic-agent-8.13.0-linux-x86_64/elastic-agent install -nf --url=https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp.elastic-cloud.com:443 --enrollment-token=ENROLLMENT_TOKEN --proxy-url=http://10.80.40.162:3128

add an invalid proxy (http://10.40.80.1:8888) on Fleet settings
add the invalid proxy to fleet server
agent status show as failed:

Every 2.0s: /opt/Elastic/Agent/elastic-agent stat...  elastic-agent: Wed Mar 20 16:43:42 2024

┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (FAILED) Actions: fail to communicate with Fleet Server API client hosts: all h
osts failed: 1 error occurred:
   │      * requester 0/1 to host https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp
.elastic-cloud.com:443/ errored: Get "https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2
.gcp.elastic-cloud.com:443/api/status?": context deadline exceeded
   │
   │
   ├─ info
   │  ├─ id: 287e45c6-635e-4461-8c85-4d58704172d2
   │  ├─ version: 8.13.0
   │  └─ commit: 533443d148f4cf71e7c3e8efb736eda8275c4f69
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '41285'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '41322'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '41294'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ log-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '41251'
   │  ├─ log-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ log-default-logfile-system-d223f8b3-cd86-4c37-8fed-a00e6fbbf3cd
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ system/metrics-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '41269'
      ├─ system/metrics-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ system/metrics-default-system/metrics-system-d223f8b3-cd86-4c37-8fed-a00e6fbbf3cd

The status does eventually clear if you delete the incorrect proxy.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-03-22T16:18:21Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

pierrehilbert · 2024-03-22T17:04:59Z

Are we sure the config is not applied for the fleet-server part?

AndersonQ · 2024-03-25T10:37:46Z

I could check again, but yes, the agent was not applying the config. A simple test is to reproduce the issue and fix the proxy in the policy and observe the agent will report as health again

cmacknz · 2024-03-25T15:30:10Z

┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
├─ status: (FAILED) Actions: fail to communicate with Fleet Server API client hosts: all h
osts failed: 1 error occurred:
│ * requester 0/1 to host https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2.gcp
.elastic-cloud.com:443/ errored: Get "https://cc39c3bc6adb42aaab83f73160a4b638.fleet.us-west2
.gcp.elastic-cloud.com:443/api/status?": context deadline exceeded

Why is the Fleet status healthy but the agent status isn't? The reason we use a separate Fleet status in the first place was so we'd stop considering transient Fleet errors a reason why the agent would be unhealthy (and if the agent is offline, it can't report Fleet status anyway).

The error appears to be coming from:

elastic-agent/internal/pkg/agent/application/actions/handlers/handler_action_policy_change.go

Lines 192 to 197 in d558694

    
           resp, err := client.Send(ctx, http.MethodGet, "/api/status", nil, nil, nil) 
        
           if err != nil { 
        
           	return errors.New( 
        
           		err, "fail to communicate with Fleet Server API client hosts", 
        
           		errors.TypeNetwork, errors.M("hosts", h.config.Fleet.Client.Hosts)) 
        
           }

I think that function might be globally setting the agent status regardless of where it was called from:

elastic-agent/internal/pkg/remote/client.go

Lines 209 to 213 in d558694

    
           // Using the same lock that was used for sorting above 
        
           c.clientLock.Lock() 
        
           requester.SetLastError(err) 
        
           c.clientLock.Unlock()

AndersonQ · 2024-03-26T06:57:39Z

Why is the Fleet status healthy but the agent status isn't?

I thought it was global-ish error state for the fleetclient but perhaps it isn't. As you pointed out, the flle status is healthy, what is correct. And paying more attention at the error, it startes with Action: which leads me to believe this error is set because the Polich Change action failed. What is indeed correct, but the way it's presented is confusing.

I had a quick look at the code, and I believe here is where the error is collected and set on the agent status

elastic-agent/internal/pkg/agent/application/coordinator/coordinator_state.go

Lines 201 to 203 in ad7e1b5

    
           } else if c.actionsErr != nil { 
        
           	s.State = agentclient.Failed 
        
           	s.Message = fmt.Sprintf("Actions: %s", c.actionsErr.Error())

cmacknz · 2024-03-26T17:50:37Z

What clears that error once it is set? Another successful action?

AndersonQ · 2024-06-06T09:49:53Z

@cmacknz, IIRC, yes, a successful action would clear the error.

@pierrehilbert @cmacknz it's still relevant right?

nimarezainia · 2024-06-06T21:45:25Z

I would say this is very relevant. Perhaps even related to this: https://github.com/elastic/ingest-dev/issues/3234
We do want to inform the user if there are proxy issues, ideally before the config is applied.

AndersonQ · 2024-06-11T15:13:29Z

@nimarezainia, what do you mean by informing the user before the config is applied?

I'm wondering if you mean some how test it before sending to the agents.
The only way to be 100% sure the proxy config indeed work is sending it to the agent so the agent can test it. And it is per agent, the same config might be valid for one agent but invalid for another.

amitkanfer · 2024-06-11T15:44:57Z

I believe Nima is referring to two-phase commit protocol which i don't think we want to focus on right now.. basically all agents report back to fleet server that a new config is valid (eg. "prepare"), and only then the "commit" phase happens where all agents apply the new config.

nimarezainia · 2024-06-12T02:12:13Z

Yes a two commit would work. Many of these configs (as @AndersonQ stated) would need to be tested at the agent itself. I am thinking mainly of connectivity related configurations, like the connection to Fleet Server, Outputs or the Download, before that config is applied, test whether you even have a route to the endpoint. Then apply/commit the configuration. If the test fails, don't change the config and flag this.

We don't want a small mistake in the configuration to bring down the whole Fleet.

AndersonQ added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent does not apply an invalid proxy for fleet-server, but shows as unhealthy #4472

Agent does not apply an invalid proxy for fleet-server, but shows as unhealthy #4472

AndersonQ commented Mar 22, 2024 •

edited by cmacknz

Loading

elasticmachine commented Mar 22, 2024

pierrehilbert commented Mar 22, 2024

AndersonQ commented Mar 25, 2024

cmacknz commented Mar 25, 2024

AndersonQ commented Mar 26, 2024

cmacknz commented Mar 26, 2024

AndersonQ commented Jun 6, 2024

nimarezainia commented Jun 6, 2024 •

edited

Loading

AndersonQ commented Jun 11, 2024

amitkanfer commented Jun 11, 2024

nimarezainia commented Jun 12, 2024

Agent does not apply an invalid proxy for fleet-server, but shows as unhealthy #4472

Agent does not apply an invalid proxy for fleet-server, but shows as unhealthy #4472

Comments

AndersonQ commented Mar 22, 2024 • edited by cmacknz Loading

Steps to Reproduce:

elasticmachine commented Mar 22, 2024

pierrehilbert commented Mar 22, 2024

AndersonQ commented Mar 25, 2024

cmacknz commented Mar 25, 2024

AndersonQ commented Mar 26, 2024

cmacknz commented Mar 26, 2024

AndersonQ commented Jun 6, 2024

nimarezainia commented Jun 6, 2024 • edited Loading

AndersonQ commented Jun 11, 2024

amitkanfer commented Jun 11, 2024

nimarezainia commented Jun 12, 2024

AndersonQ commented Mar 22, 2024 •

edited by cmacknz

Loading

nimarezainia commented Jun 6, 2024 •

edited

Loading