Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wait for watcher #4229

Conversation

pchila
Copy link
Member

@pchila pchila commented Feb 8, 2024

What does this PR do?

This PR adds a wait after launching the upgrade watcher to make sure that the process is up and running before restarting agent.

Why is it important?

We want to avoid race conditions that may happen in case of downgrade where after restart an older watcher that does not understand the new update marker format may break the whole agent installation.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

Sorry, something went wrong.

Verified

This commit was signed with the committer’s verified signature.
tl-sl TimL
@pchila pchila added the bug Something isn't working label Feb 8, 2024
@pchila pchila requested a review from cmacknz February 8, 2024 18:03
@pchila pchila self-assigned this Feb 8, 2024
@pchila pchila requested a review from a team as a code owner February 8, 2024 18:03
@pchila pchila requested review from AndersonQ, ycombinator and blakerouse and removed request for a team February 8, 2024 18:03
@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Feb 8, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Comment on lines 353 to 355
case <-watcherContext.Done():
log.Error("upgrade watcher did not start watching within %s or context has expired", waitTime)
return ErrWatcherNotStarted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you capture the error you should get different errors, which will explain if it was a deadline exceeded or a context cancelled.

Suggested change
case <-watcherContext.Done():
log.Error("upgrade watcher did not start watching within %s or context has expired", waitTime)
return ErrWatcherNotStarted
case err := <-watcherContext.Done():
log.Errorf("upgrade watcher did not start watching within %s or context has expired: %w", waitTime, err)
return ErrWatcherNotStarted

@pchila pchila requested a review from AndersonQ February 9, 2024 08:47
@rdner
Copy link
Member

rdner commented Feb 9, 2024

I think there is a legit failure on Mac in one of the unit tests:

=== Failed
=== FAIL: internal/pkg/agent/application/upgrade TestWaitForWatcher/Runaround_path:_marker_is_jumping_around_and_landing_on_watching (0.10s)
    upgrade_test.go:856:
        	Error Trace:	/Users/admin/builds/bk-agent-prod-orka-1707490299318330142/elastic/elastic-agent/internal/pkg/agent/application/upgrade/upgrade_test.go:856
        	Error:      	Received unexpected error:
        	            	watcher did not start in time
        	            	context deadline exceeded
        	Test:       	TestWaitForWatcher/Runaround_path:_marker_is_jumping_around_and_landing_on_watching
        	Messages:   	waitForWatcher /tmp/TestWaitForWatcherRunaround_path_marker_is_jumping_around_and_landing_on_watching3782740199/001/.update-marker, [UPG_REQUESTED UPG_SCHEDULED UPG_DOWNLOADING UPG_EXTRACTING UPG_REPLACING UPG_RESTARTING UPG_WATCHING], 1ms, 100ms)
=== FAIL: internal/pkg/agent/application/upgrade TestWaitForWatcher (0.29s)

@pchila
Copy link
Member Author

pchila commented Feb 9, 2024

I think there is a legit failure on Mac in one of the unit tests:

=== Failed
=== FAIL: internal/pkg/agent/application/upgrade TestWaitForWatcher/Runaround_path:_marker_is_jumping_around_and_landing_on_watching (0.10s)
    upgrade_test.go:856:
        	Error Trace:	/Users/admin/builds/bk-agent-prod-orka-1707490299318330142/elastic/elastic-agent/internal/pkg/agent/application/upgrade/upgrade_test.go:856
        	Error:      	Received unexpected error:
        	            	watcher did not start in time
        	            	context deadline exceeded
        	Test:       	TestWaitForWatcher/Runaround_path:_marker_is_jumping_around_and_landing_on_watching
        	Messages:   	waitForWatcher /tmp/TestWaitForWatcherRunaround_path_marker_is_jumping_around_and_landing_on_watching3782740199/001/.update-marker, [UPG_REQUESTED UPG_SCHEDULED UPG_DOWNLOADING UPG_EXTRACTING UPG_REPLACING UPG_RESTARTING UPG_WATCHING], 1ms, 100ms)
=== FAIL: internal/pkg/agent/application/upgrade TestWaitForWatcher (0.29s)

Yeah, I already slowed the test down for MacOS and Windows but it seems that it is still a bit too fast...

@rdner rdner added the backport-v8.12.0 Automated backport with mergify label Feb 12, 2024
@pchila pchila removed the backport-v8.12.0 Automated backport with mergify label Feb 12, 2024
@pchila
Copy link
Member Author

pchila commented Feb 12, 2024

@rdner I removed the backport v8.12.0 label from this PR in favor of #4238 which contains only the fix for the GRPC server shutdown.
I will still wait for https://buildkite.com/elastic/elastic-agent/builds/7004 to complete to verify that this fixed the problem as it was more apparent on this branch and then I will remove the commit with the GRPC fix from this branch

@rdner
Copy link
Member

rdner commented Feb 12, 2024

@pchila the backport label was for the watcher fix, we need to backport it because it's failing on 8.12 too, see one of the recent builds https://buildkite.com/elastic/elastic-agent/builds/7000#018d9c5d-a923-4f6a-90f5-218885912d91

Unless you think there is a strong reason why we should not backport this fix to 8.12, please set the label back.

@pchila
Copy link
Member Author

pchila commented Feb 12, 2024

I am not sure we can backport the fix for the watcher wait because it is built upon major changes contained in #4193

I will add the label back as a reminder but I doubt it will be a straightforward backport

@pchila pchila added the backport-v8.12.0 Automated backport with mergify label Feb 12, 2024
@pchila pchila force-pushed the add-wait-for-watcher branch from 8b4fe94 to aa2561c Compare February 12, 2024 12:07
Copy link

Quality Gate failed Quality Gate failed

Failed conditions

37.5% 37.5% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@pchila
Copy link
Member Author

pchila commented Feb 12, 2024

https://buildkite.com/elastic/elastic-agent/builds/7034 shows 4 failures:

Merging to feature branch

@pchila pchila merged commit 2813e1f into elastic:feature/add-agent-version-to-installation-directory-name Feb 12, 2024
pchila added a commit that referenced this pull request Feb 12, 2024
* Fix rollbackInstall order of execution and handle errors
* Add wait for watcher up and running
mergify bot pushed a commit that referenced this pull request Feb 12, 2024
* Fix rollbackInstall order of execution and handle errors
* Add wait for watcher up and running

(cherry picked from commit 2813e1f)

# Conflicts:
#	internal/pkg/agent/application/upgrade/rollback.go
#	internal/pkg/agent/application/upgrade/upgrade.go
#	internal/pkg/agent/application/upgrade/upgrade_test.go
#	internal/pkg/agent/cmd/run.go
pchila added a commit that referenced this pull request Feb 12, 2024
* Fix rollbackInstall order of execution and handle errors
* Add wait for watcher up and running
cmacknz added a commit that referenced this pull request Feb 12, 2024
* Use package manifest for install (#4173)

* use version in path for rpm and deb
* install elastic-agent remapping paths

* Use package manifests for upgrade (#4174)

* map paths when upgrading using .tar.gz packages
* use structured output from unpack to identify old and new agents directories
* copy state directories and rotate symlink correctly

* Support version in watcher cleanup rollback (#4176)

* extract directory paths to allow unit testing (instead of nested paths.* calls)
* introduce tests for watcher cleanup and rollback
* close data directory after cleanup

* Reintroduce upgrade version check (#4177)

* Fix shutdownCallback() directories
* reintroduce version check in upgrade

* Use watcher that supports paths in update marker (#4178)

* Invoke agent watcher capable of handling new paths

* Adapt TestShutdownCallback to run multiple testcases

* Fix shutdownCallback to work with version in path

* Fix MacOS relink step

* Add commit hash to manifest

* document manifest-aware upgrade
---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>

* small fixes based on PR feedback

* Add wait for watcher (#4229)

* Fix rollbackInstall order of execution and handle errors
* Add wait for watcher up and running

---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.12.0 Automated backport with mergify bug Something isn't working skip-changelog Team:Elastic-Agent Label for the Agent team
Projects
None yet
6 participants