Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test]: Multiple test cases fail because of the artifact API / CDN unavailability #4268

Closed
rdner opened this issue Feb 15, 2024 · 19 comments
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@rdner
Copy link
Member

rdner commented Feb 15, 2024

Any test case that downloads artifacts (e.g. TestStandaloneUpgrade*, TestUpgrade*, TestFleetManagedUpgrade*) may fail because the artifact API or CDN server is unavailable.

Here is a list with a few examples (the timestamp is recorded on the buildkite failure, so it's approximate):

Artifact API

2024-02-12T20:28:05.483Z

https://buildkite.com/elastic/elastic-agent/builds/7062#018d9ec4-9b4a-4877-b32e-4d9810e99d75

    fixture.go:632: >> running binary with: [/tmp/TestStandaloneUpgradeRetryDownload3070379000/001/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64/elastic-agent version --binary-only --yaml]
    upgrade_standalone_retry_test.go:58: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_standalone_retry_test.go:58
        	Error:      	Received unexpected error:
        	            	failed to get a list of builds: 503: bad http status code
        	Test:       	TestStandaloneUpgradeRetryDownload

2024-02-13T21:24:31.816Z

https://buildkite.com/elastic/elastic-agent/builds/7141#018da41e-2c94-4883-a427-5120735369e5

    upgrade_broken_package_test.go:45: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_broken_package_test.go:45
        	Error:      	Received unexpected error:
        	            	error retrieving versions from Artifact API: getting versions: executing http request &{GET https://artifacts-api.elastic.co/v1/versions/ HTTP/1.1 1 1 map[]   0 [] false artifacts-api.elastic.co map[] map[]  map[]      0xc0000fb960}: Get "https://artifacts-api.elastic.co/v1/versions/": dial tcp 35.188.12.98:443: i/o timeout
        	Test:       	TestUpgradeBrokenPackageVersion

CDN

2024-02-13T16:11:14.713Z

https://buildkite.com/elastic/elastic-agent/builds/7119#018da2fc-1a7d-443b-9ed7-89566b637e75

fetcher_artifact.go:344: Downloading artifact progress 39.78%
    upgrade_standalone_test.go:77: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_standalone_test.go:77
        	            				/home/ubuntu/agent/testing/integration/upgrade_standalone_test.go:52
        	Error:      	Received unexpected error:
        	            	failed to prepare the startFixture: failed to download https://staging.elastic.co/7.17.18-bdb3703e/downloads/beats/elastic-agent/elastic-agent-7.17.18-linux-arm64.tar.gz: failed to write file /home/ubuntu/agent/.agent-testing/artifact/elastic-agent-7.17.18-linux-arm64.tar.gz: stream error: stream ID 1; INTERNAL_ERROR; received from peer
        	Test:       	TestStandaloneUpgrade/Upgrade_7.17.18_to_8.13.0-SNAPSHOT_(privileged)

2024-02-14T20:02:31.839Z

https://buildkite.com/elastic/elastic-agent/builds/7183#018da8f0-9957-45fc-9622-696d327bebf7

    fetcher_artifact.go:344: Downloading artifact progress 90.63%
    upgrade_fleet_test.go:406: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:406
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:149
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:127
        	Error:      	Received unexpected error:
        	            	failed to download https://snapshots.elastic.co/8.13.0-772867d3/downloads/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64.tar.gz: failed to write file /tmp/TestFleetAirGappedUpgradePrivileged2800565667/002/downloads/beats/elastic-agent/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64.tar.gz: read tcp 10.128.0.249:44254->34.120.127.130:443: read: connection reset by peer
        	Test:       	TestFleetAirGappedUpgradePrivileged
        	Messages:   	could not download agent 8.13.0-SNAPSHOT
--- FAIL: TestFleetAirGappedUpgradePrivileged (78.33s)
@rdner rdner added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Feb 15, 2024
@rdner rdner self-assigned this Feb 15, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@rdner rdner changed the title [Flaky Test]: Multiple test cases fail because of the artifact API unavailability [Flaky Test]: Multiple test cases fail because of the artifact API / CDN unavailability Feb 15, 2024
@rdner
Copy link
Member Author

rdner commented Feb 15, 2024

failed to get a list of builds: 503: bad http status code

This error from the artifact API was a known outage that lasted for 4 minutes.

@rdner
Copy link
Member Author

rdner commented Feb 20, 2024

Another failure due to a timeout on GET https://artifacts-api.elastic.co/v1/versions/8.13.0-SNAPSHOT/builds/

upgrade_uninstall_test.go:60: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_uninstall_test.go:60
        	Error:      	Received unexpected error:
        	            	failed to get a list of builds: getting builds for version 8.13.0-SNAPSHOT: executing http request &{GET https://artifacts-api.elastic.co/v1/versions/8.13.0-SNAPSHOT/builds/ HTTP/1.1 1 1 map[]   0 [] false artifacts-api.elastic.co map[] map[]  map[]      0x40002ae000}: Get "https://artifacts-api.elastic.co/v1/versions/8.13.0-SNAPSHOT/builds/": dial tcp 35.188.12.98:443: i/o timeout
        	Test:       	TestStandaloneUpgradeUninstallKillWatcher
--- FAIL: TestStandaloneUpgradeUninstallKillWatcher (43.64s)

https://buildkite.com/elastic/elastic-agent/builds/7268#018dc3d2-0f6e-438c-8510-835b6e4334af

@rdner
Copy link
Member Author

rdner commented Feb 20, 2024

I think we should implement retries in our artifact API client to mitigate this problem.

@rdner
Copy link
Member Author

rdner commented Feb 20, 2024

Another failure on GET https://artifacts-api.elastic.co/v1/versions/ in https://buildkite.com/elastic/elastic-agent/builds/7275#018dc65b-8d69-4b23-be5f-81d07defc3fd

=== RUN   TestStandaloneUpgradeRollback
    upgrade_rollback_test.go:55: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_rollback_test.go:55
        	Error:      	Received unexpected error:
        	            	error retrieving versions from Artifact API: getting versions: executing http request &{GET https://artifacts-api.elastic.co/v1/versions/ HTTP/1.1 1 1 map[]   0 [] false artifacts-api.elastic.co map[] map[]  map[]      0x4000258000}: Get "https://artifacts-api.elastic.co/v1/versions/": dial tcp 35.188.12.98:443: i/o timeout
        	Test:       	TestStandaloneUpgradeRollback
--- FAIL: TestStandaloneUpgradeRollback (30.01s)

@rdner
Copy link
Member Author

rdner commented Feb 21, 2024

Another failure in pretty much all the tests that use the artifact API, this time it returned 503 https://buildkite.com/elastic/elastic-agent/builds/7305#018dc85a-fece-4c24-b9b5-ce18d1435ddc

Error sample:

 upgrade_fleet_test.go:90: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:90
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:49
        	Error:      	Received unexpected error:
        	            	failed to find snapshot URI for version 8.13.0-SNAPSHOT: failed to find package URL: https://artifacts-api.elastic.co/v1/search/8.13.0-SNAPSHOT/elastic-agent; bad status: 503 Service Unavailable
        	Test:       	TestFleetManagedUpgradeUnprivileged

@rdner
Copy link
Member Author

rdner commented Feb 21, 2024

Another failure Get "https://artifacts-api.elastic.co/v1/versions/": dial tcp 35.188.12.98:443: i/o timeout in https://buildkite.com/elastic/elastic-agent/builds/7339#018dcc31-3f59-49f1-a027-dc8e8e94c7ba

@rdner
Copy link
Member Author

rdner commented Feb 26, 2024

Another failure here https://buildkite.com/elastic/elastic-agent/builds/7480#018de2b8-47e5-4e01-843a-c294f7e3b56b

    fetcher_artifact.go:344: Downloading artifact progress 73.09%
    upgrade_fleet_test.go:406: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:406
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:149
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:127
        	Error:      	Received unexpected error:
        	            	failed to download https://snapshots.elastic.co/8.14.0-d950d4db/downloads/beats/elastic-agent/elastic-agent-8.14.0-SNAPSHOT-linux-arm64.tar.gz: failed to write file /tmp/TestFleetAirGappedUpgradePrivileged1764917254/002/downloads/beats/elastic-agent/beats/elastic-agent/elastic-agent-8.14.0-SNAPSHOT-linux-arm64.tar.gz: stream error: stream ID 7; INTERNAL_ERROR; received from peer
        	Test:       	TestFleetAirGappedUpgradePrivileged
        	Messages:   	could not download agent 8.14.0-SNAPSHOT
--- FAIL: TestFleetAirGappedUpgradePrivileged (653.81s)

@rdner
Copy link
Member Author

rdner commented Mar 6, 2024

We added retries for the artifact API #4348

@rdner
Copy link
Member Author

rdner commented Mar 6, 2024

Download got stuck at 50.99%

fetcher_artifact.go:344: Downloading artifact progress 50.99%
    upgrade_fleet_test.go:406: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:406
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:149
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:115
        	Error:      	Received unexpected error:
        	            	failed to download https://snapshots.elastic.co/8.14.0-f84a1ada/downloads/beats/elastic-agent/elastic-agent-8.14.0-SNAPSHOT-linux-arm64.tar.gz: failed to write file /tmp/TestFleetAirGappedUpgradeUnprivileged3011941863/002/downloads/beats/elastic-agent/beats/elastic-agent/elastic-agent-8.14.0-SNAPSHOT-linux-arm64.tar.gz: stream error: stream ID 7; INTERNAL_ERROR; received from peer
        	Test:       	TestFleetAirGappedUpgradeUnprivileged
        	Messages:   	could not download agent 8.14.0-SNAPSHOT

https://buildkite.com/elastic/elastic-agent/builds/7573#018e12ba-1e06-4cc8-9bfd-39a99763a99a

@rdner
Copy link
Member Author

rdner commented Mar 6, 2024

Another download failed:

fetcher_artifact.go:344: Downloading artifact progress 2.83%
    upgrade_fleet_test.go:90: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:90
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:49
        	Error:      	Received unexpected error:
        	            	failed to download https://snapshots.elastic.co/8.14.0-78df864a/downloads/beats/elastic-agent/elastic-agent-8.14.0-SNAPSHOT-linux-arm64.tar.gz: failed to write file /home/ubuntu/agent/.agent-testing/artifact/elastic-agent-8.14.0-SNAPSHOT-linux-arm64.tar.gz: unexpected EOF
        	Test:       	TestFleetManagedUpgradeUnprivileged

https://buildkite.com/elastic/elastic-agent/builds/7588#018e1528-f7da-4e02-b81d-0165e3b5d780

@rdner
Copy link
Member Author

rdner commented Mar 7, 2024

Another failure of the artifact API:

upgrade_fleet_test.go:403: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:403
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:149
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:127
        	Error:      	Received unexpected error:
        	            	failed to find snapshot URI for version 8.14.0-SNAPSHOT: failed to find package URL: Get "https://artifacts-api.elastic.co/v1/search/8.14.0-SNAPSHOT/elastic-agent": read tcp 10.128.0.232:53734->35.188.12.98:443: read: connection reset by peer
        	Test:       	TestFleetAirGappedUpgradePrivileged
        	Messages:   	could not prepare fetcher to download agent 8.14.0-SNAPSHOT

https://buildkite.com/elastic/elastic-agent/builds/7588#018e159c-ee97-4fd6-899b-e378390a88e9

@rdner
Copy link
Member Author

rdner commented Mar 13, 2024

Another failure on the artifact API in https://buildkite.com/elastic/elastic-agent/builds/7753#018e3765-0a82-4e78-b265-4f8cc9024146

=== RUN   TestStandaloneUpgrade/Upgrade_8.14.0-SNAPSHOT_to_8.14.0-SNAPSHOT_(privileged)
    upgrade_standalone_test.go:74: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_standalone_test.go:74
        	            				/home/ubuntu/agent/testing/integration/upgrade_standalone_test.go:49
        	Error:      	Received unexpected error:
        	            	failed to prepare before exec: failed to find snapshot URI for version 8.14.0-SNAPSHOT: failed to find package URL: Get "https://artifacts-api.elastic.co/v1/search/8.14.0-SNAPSHOT/elastic-agent": dial tcp 35.188.12.98:443: i/o timeout
        	Test:       	TestStandaloneUpgrade/Upgrade_8.14.0-SNAPSHOT_to_8.14.0-SNAPSHOT_(privileged)
--- FAIL: TestStandaloneUpgrade/Upgrade_8.14.0-SNAPSHOT_to_8.14.0-SNAPSHOT_(privileged) (30.02s)

@rdner
Copy link
Member Author

rdner commented Mar 13, 2024

Lots of 503 responses from the artifact API in https://buildkite.com/elastic/elastic-agent/builds/7763#018e3881-3ee9-4e60-942d-51a1d0daad51

@rdner
Copy link
Member Author

rdner commented Mar 13, 2024

Another failure with 503 from the artifact API https://buildkite.com/elastic/elastic-agent/builds/7761#018e3870-a7be-4f6c-a28d-c9c7c54a4a90

@rdner
Copy link
Member Author

rdner commented Mar 21, 2024

@rdner
Copy link
Member Author

rdner commented Mar 22, 2024

The artifact API failed in https://buildkite.com/elastic/elastic-agent/builds/7936#018e667e-1530-4a04-ab27-9e6bc0d60d25

failed to prepare before exec: failed to find snapshot URI for version 8.12.1: failed to find package URL: Get "https://artifacts-api.elastic.co/v1/search/8.12.1/elastic-agent": dial tcp 35.188.12.98:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

@rdner
Copy link
Member Author

rdner commented Apr 26, 2024

Since 2024-04-22 at 16:00 UTC we rolled out a new CDN proxy and it seems to have solved the stuck downloads.

The artifact API part of the issue will be addressed by #4458

@rdner rdner closed this as completed Apr 26, 2024
@amitkanfer
Copy link
Contributor

Well done @rdner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

3 participants