You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a problem here, which is that our InvokeWatcher function only confirms that the watcher we attempted to launch from the upgrade was assigned a PID. It does not confirm that the version of the watcher launched as part of the upgrade is the one that got the lock.
If an upgrade is actually a downgrade, which is the case in many of our upgrade tests, this can result in the "old" version of the watcher getting the lock instead of the intended "new" version.
This exact situation has been confirmed as the source of a test failure in #4193 where the structure of the Elastic Agent paths has changed. See the logs captured for the TestStandaloneDowngradeToSpecificSnapshotBuild failure in https://buildkite.com/elastic/elastic-agent/builds/6911#018d8596-1f53-44e7-93ef-857bc1b4ff31. The "old" watcher does not understand the new path structure of the agent and so unconditionally removes both versions of the agent as part of cleanup, unintentionally uninstalling the agent.
In the logs linked above, we can see the current agent launch its watcher with hash b04c7c and PID 24769:
The final logs we see for the running watcher are from this later PID 24780 running and then removing both versions of the agent:
{"log.level":"info","@timestamp":"2024-02-07T22:45:55.401Z","log.origin":{"file.name":"cmd/watch.go","file.line":67},"message":"Upgrade Watcher started","process.pid":24780,"agent.version":"8.13.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:55.406Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":66},"message":"Agent watcher started","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:10.407Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":133},"message":"Trying to connect to agent","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:10.408Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":164},"message":"Connected to agent","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:10.411Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":197},"message":"Communicating with PID 24739","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:54.901Z","log.origin":{"file.name":"cmd/watch.go","file.line":176},"message":"Grace period passed, not watching","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-02-07T22:46:54.901Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":184},"message":"Lost connection: failed reading next state: rpc error: code = Canceled desc = context canceled","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:54.906Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":59},"message":"Cleaning up upgrade","hash":"8dcab2","remove_marker":true,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:56.907Z","log.origin":{"file.name":"upgrade/step_mark.go","file.line":156},"message":"Removing marker file","file.path":"/opt/Elastic/Agent/data/.update-marker","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:56.907Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":82},"message":"Removing previous symlink path","file.path":"/opt/Elastic/Agent/elastic-agent.prev","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:56.907Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":97},"message":"Removing hashed data directory","file.path":"/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-b04c7c","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:57.198Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":97},"message":"Removing hashed data directory","file.path":"/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-8dcab2","ecs.version":"1.6.0"}
This happens because the agent upgrade happened using the new path structure of elastic-agent-$version-$hash instead of the old path structure elastic-agent-$hash which fails to match the path check in the "old" watcher, leading to both versions of the agent being deleted.
The scenario where the watcher version that is started is not deterministic is always present, but it only result in a fatal error like this when doing a downgrade to a version before 8.13.0.
Proposed Solution
When the watcher gets the lock it writes the UPG_WATCHING state to the upgrade marker:
The code that calls InvokeWatcher as part of the upgrade can poll the upgrade marker to watch for this UPG_WATCHING state update to confirm it executed far enough to get the lock.
We must do this polling on a timeout in case the update never occurs, initially the timeout is proposed to be 30s.
If we hit the timeout, we need to abort the upgrade and roll back to the state we were in before the upgrade started. This requires two things to happen:
Killing the watcher PID to ensure it doesn't accidentally roll us back. If we can convince ourselves leaving the watcher running is fine we can skip this step.
Rolling back the upgrade and putting the symlink back to what it was before. This can be done by calling the rollback install function already in place to do this:
Note that there is a separate bug in this function, we must rotate the symlink first, and then remove the "new" agent version. If we remove the "new" agent version first, and then the symlink rotation fails or the agent host is restarted the agent will fail to start up. We need to reverse the order of operations in this function.
The text was updated successfully, but these errors were encountered:
Implemented correct order of install rollback and added wait for watcher to set WATCHING state in update marker.
Fixed by merging of #4229 and #4193 on main,
Problem
One of the last steps we take in the upgrade process before re-executing the newly installed version is invoking the upgrade watcher:
elastic-agent/internal/pkg/agent/application/upgrade/upgrade.go
Lines 209 to 215 in d8f5134
Similarly, when the agent first starts up after a restart, including the restart for an upgrade, the watcher is also invoked:
elastic-agent/internal/pkg/agent/cmd/run.go
Lines 228 to 232 in d8f5134
The implementation of
InvokeWatcher
is as follows:elastic-agent/internal/pkg/agent/application/upgrade/rollback.go
Lines 110 to 136 in d8f5134
To ensure only one version of the watcher is running at a time, we rely on a file lock:
elastic-agent/internal/pkg/agent/cmd/watch.go
Lines 79 to 91 in d8f5134
There is a problem here, which is that our
InvokeWatcher
function only confirms that the watcher we attempted to launch from the upgrade was assigned a PID. It does not confirm that the version of the watcher launched as part of the upgrade is the one that got the lock.If an upgrade is actually a downgrade, which is the case in many of our upgrade tests, this can result in the "old" version of the watcher getting the lock instead of the intended "new" version.
This exact situation has been confirmed as the source of a test failure in #4193 where the structure of the Elastic Agent paths has changed. See the logs captured for the
TestStandaloneDowngradeToSpecificSnapshotBuild
failure in https://buildkite.com/elastic/elastic-agent/builds/6911#018d8596-1f53-44e7-93ef-857bc1b4ff31. The "old" watcher does not understand the new path structure of the agent and so unconditionally removes both versions of the agent as part of cleanup, unintentionally uninstalling the agent.In the logs linked above, we can see the current agent launch its watcher with hash
b04c7c
and PID 24769:The agent then re-execs the version we are downgrading to, which launches its version of the watcher with hash
8dcab2
and PID 24780.The final logs we see for the running watcher are from this later PID 24780 running and then removing both versions of the agent:
This happens because the agent upgrade happened using the new path structure of
elastic-agent-$version-$hash
instead of the old path structureelastic-agent-$hash
which fails to match the path check in the "old" watcher, leading to both versions of the agent being deleted.elastic-agent/internal/pkg/agent/application/upgrade/rollback.go
Lines 80 to 90 in d8f5134
The scenario where the watcher version that is started is not deterministic is always present, but it only result in a fatal error like this when doing a downgrade to a version before 8.13.0.
Proposed Solution
When the watcher gets the lock it writes the
UPG_WATCHING
state to the upgrade marker:elastic-agent/internal/pkg/agent/cmd/watch.go
Lines 107 to 110 in d8f5134
elastic-agent/internal/pkg/agent/cmd/watch.go
Lines 238 to 242 in d8f5134
The code that calls
InvokeWatcher
as part of the upgrade can poll the upgrade marker to watch for thisUPG_WATCHING
state update to confirm it executed far enough to get the lock.We must do this polling on a timeout in case the update never occurs, initially the timeout is proposed to be 30s.
If we hit the timeout, we need to abort the upgrade and roll back to the state we were in before the upgrade started. This requires two things to happen:
elastic-agent/internal/pkg/agent/application/upgrade/upgrade.go
Lines 272 to 275 in d8f5134
Note that there is a separate bug in this function, we must rotate the symlink first, and then remove the "new" agent version. If we remove the "new" agent version first, and then the symlink rotation fails or the agent host is restarted the agent will fail to start up. We need to reverse the order of operations in this function.
The text was updated successfully, but these errors were encountered: