-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wait for watcher #4229
Add wait for watcher #4229
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
case <-watcherContext.Done(): | ||
log.Error("upgrade watcher did not start watching within %s or context has expired", waitTime) | ||
return ErrWatcherNotStarted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you capture the error you should get different errors, which will explain if it was a deadline exceeded or a context cancelled.
case <-watcherContext.Done(): | |
log.Error("upgrade watcher did not start watching within %s or context has expired", waitTime) | |
return ErrWatcherNotStarted | |
case err := <-watcherContext.Done(): | |
log.Errorf("upgrade watcher did not start watching within %s or context has expired: %w", waitTime, err) | |
return ErrWatcherNotStarted |
I think there is a legit failure on Mac in one of the unit tests:
|
Yeah, I already slowed the test down for MacOS and Windows but it seems that it is still a bit too fast... |
@rdner I removed the backport v8.12.0 label from this PR in favor of #4238 which contains only the fix for the GRPC server shutdown. |
@pchila the backport label was for the watcher fix, we need to backport it because it's failing on 8.12 too, see one of the recent builds https://buildkite.com/elastic/elastic-agent/builds/7000#018d9c5d-a923-4f6a-90f5-218885912d91 Unless you think there is a strong reason why we should not backport this fix to 8.12, please set the label back. |
I am not sure we can backport the fix for the watcher wait because it is built upon major changes contained in #4193 I will add the label back as a reminder but I doubt it will be a straightforward backport |
8b4fe94
to
aa2561c
Compare
|
https://buildkite.com/elastic/elastic-agent/builds/7034 shows 4 failures:
Merging to feature branch |
2813e1f
into
elastic:feature/add-agent-version-to-installation-directory-name
* Fix rollbackInstall order of execution and handle errors * Add wait for watcher up and running
* Fix rollbackInstall order of execution and handle errors * Add wait for watcher up and running (cherry picked from commit 2813e1f) # Conflicts: # internal/pkg/agent/application/upgrade/rollback.go # internal/pkg/agent/application/upgrade/upgrade.go # internal/pkg/agent/application/upgrade/upgrade_test.go # internal/pkg/agent/cmd/run.go
* Fix rollbackInstall order of execution and handle errors * Add wait for watcher up and running
* Use package manifest for install (#4173) * use version in path for rpm and deb * install elastic-agent remapping paths * Use package manifests for upgrade (#4174) * map paths when upgrading using .tar.gz packages * use structured output from unpack to identify old and new agents directories * copy state directories and rotate symlink correctly * Support version in watcher cleanup rollback (#4176) * extract directory paths to allow unit testing (instead of nested paths.* calls) * introduce tests for watcher cleanup and rollback * close data directory after cleanup * Reintroduce upgrade version check (#4177) * Fix shutdownCallback() directories * reintroduce version check in upgrade * Use watcher that supports paths in update marker (#4178) * Invoke agent watcher capable of handling new paths * Adapt TestShutdownCallback to run multiple testcases * Fix shutdownCallback to work with version in path * Fix MacOS relink step * Add commit hash to manifest * document manifest-aware upgrade --------- Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co> * small fixes based on PR feedback * Add wait for watcher (#4229) * Fix rollbackInstall order of execution and handle errors * Add wait for watcher up and running --------- Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
What does this PR do?
This PR adds a wait after launching the upgrade watcher to make sure that the process is up and running before restarting agent.
Why is it important?
We want to avoid race conditions that may happen in case of downgrade where after restart an older watcher that does not understand the new update marker format may break the whole agent installation.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testAuthor's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs
Questions to ask yourself