Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The version of the watcher that gets the watcher lock is not deterministic. #4228

Closed
cmacknz opened this issue Feb 8, 2024 · 2 comments · Fixed by #4193 or #4229
Closed

The version of the watcher that gets the watcher lock is not deterministic. #4228

cmacknz opened this issue Feb 8, 2024 · 2 comments · Fixed by #4193 or #4229
Assignees
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Feb 8, 2024

Problem

One of the last steps we take in the upgrade process before re-executing the newly installed version is invoking the upgrade watcher:

if err := InvokeWatcher(u.log); err != nil {
u.log.Errorw("Rolling back: starting watcher failed", "error.message", err)
rollbackInstall(ctx, u.log, newHash)
return nil, err
}
cb := shutdownCallback(u.log, paths.Home(), release.Version(), version, release.TrimCommit(newHash))

Similarly, when the agent first starts up after a restart, including the restart for an upgrade, the watcher is also invoked:

// initiate agent watcher
if err := upgrade.InvokeWatcher(l); err != nil {
// we should not fail because watcher is not working
l.Error(errors.New(err, "failed to invoke rollback watcher"))
}

The implementation of InvokeWatcher is as follows:

// InvokeWatcher invokes an agent instance using watcher argument for watching behavior of
// agent during upgrade period.
func InvokeWatcher(log *logger.Logger) error {
if !IsUpgradeable() {
log.Info("agent is not upgradable, not starting watcher")
return nil
}
cmd := invokeCmd()
defer func() {
if cmd.Process != nil {
log.Infof("releasing watcher %v", cmd.Process.Pid)
_ = cmd.Process.Release()
}
}()
log.Infow("Starting upgrade watcher", "path", cmd.Path, "args", cmd.Args, "env", cmd.Env, "dir", cmd.Dir)
if err := cmd.Start(); err != nil {
return fmt.Errorf("failed to start Upgrade Watcher: %w", err)
}
upgradeWatcherPID := cmd.Process.Pid
agentPID := os.Getpid()
log.Infow("Upgrade Watcher invoked", "agent.upgrade.watcher.process.pid", upgradeWatcherPID, "agent.process.pid", agentPID)
return nil
}

To ensure only one version of the watcher is running at a time, we rely on a file lock:

locker := filelock.NewAppLocker(paths.Top(), watcherLockFile)
if err := locker.TryLock(); err != nil {
if errors.Is(err, filelock.ErrAppAlreadyRunning) {
log.Info("exiting, lock already exists")
return nil
}
log.Error("failed to acquire lock", err)
return err
}
defer func() {
_ = locker.Unlock()
}()

There is a problem here, which is that our InvokeWatcher function only confirms that the watcher we attempted to launch from the upgrade was assigned a PID. It does not confirm that the version of the watcher launched as part of the upgrade is the one that got the lock.

If an upgrade is actually a downgrade, which is the case in many of our upgrade tests, this can result in the "old" version of the watcher getting the lock instead of the intended "new" version.

This exact situation has been confirmed as the source of a test failure in #4193 where the structure of the Elastic Agent paths has changed. See the logs captured for the TestStandaloneDowngradeToSpecificSnapshotBuild failure in https://buildkite.com/elastic/elastic-agent/builds/6911#018d8596-1f53-44e7-93ef-857bc1b4ff31. The "old" watcher does not understand the new path structure of the agent and so unconditionally removes both versions of the agent as part of cleanup, unintentionally uninstalling the agent.

In the logs linked above, we can see the current agent launch its watcher with hash b04c7c and PID 24769:

{"log.level":"info","@timestamp":"2024-02-07T22:45:54.896Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Starting upgrade watcher","log":{"source":"elastic-agent"},"path":"/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-b04c7c/elastic-agent","args":["/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-b04c7c/elastic-agent","watch","--path.config","/opt/Elastic/Agent","--path.home","/opt/Elastic/Agent"],"env":[],"dir":"","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:54.896Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.0-SNAPSHOT","state":"UPG_REPLACING","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:54.896Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.0-SNAPSHOT","state":"UPG_REPLACING","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:54.905Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":162},"message":"Upgrade Watcher invoked","log":{"source":"elastic-agent"},"agent.upgrade.watcher.process.pid":24769,"agent.process.pid":24739,"ecs.version":"1.6.0"}

The agent then re-execs the version we are downgrading to, which launches its version of the watcher with hash 8dcab2 and PID 24780.

{"log.level":"info","@timestamp":"2024-02-07T22:45:55.105Z","log.origin":{"file.name":"cmd/run.go","file.line":163},"message":"Elastic Agent started","log":{"source":"elastic-agent"},"process.pid":24739,"agent.version":"8.13.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:55.323Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":126},"message":"Starting upgrade watcher","log":{"source":"elastic-agent"},"path":"/opt/Elastic/Agent/elastic-agent","args":["/opt/Elastic/Agent/elastic-agent","watch","--path.config","/opt/Elastic/Agent","--path.home","/opt/Elastic/Agent"],"env":[],"dir":"","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:55.324Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":133},"message":"Upgrade Watcher invoked","log":{"source":"elastic-agent"},"agent.upgrade.watcher.process.pid":24780,"agent.process.pid":24739,"ecs.version":"1.6.0"}

The final logs we see for the running watcher are from this later PID 24780 running and then removing both versions of the agent:

{"log.level":"info","@timestamp":"2024-02-07T22:45:55.401Z","log.origin":{"file.name":"cmd/watch.go","file.line":67},"message":"Upgrade Watcher started","process.pid":24780,"agent.version":"8.13.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:55.406Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":66},"message":"Agent watcher started","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:10.407Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":133},"message":"Trying to connect to agent","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:10.408Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":164},"message":"Connected to agent","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:10.411Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":197},"message":"Communicating with PID 24739","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:54.901Z","log.origin":{"file.name":"cmd/watch.go","file.line":176},"message":"Grace period passed, not watching","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-02-07T22:46:54.901Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":184},"message":"Lost connection: failed reading next state: rpc error: code = Canceled desc = context canceled","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:54.906Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":59},"message":"Cleaning up upgrade","hash":"8dcab2","remove_marker":true,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:56.907Z","log.origin":{"file.name":"upgrade/step_mark.go","file.line":156},"message":"Removing marker file","file.path":"/opt/Elastic/Agent/data/.update-marker","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:56.907Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":82},"message":"Removing previous symlink path","file.path":"/opt/Elastic/Agent/elastic-agent.prev","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:56.907Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":97},"message":"Removing hashed data directory","file.path":"/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-b04c7c","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:57.198Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":97},"message":"Removing hashed data directory","file.path":"/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-8dcab2","ecs.version":"1.6.0"}

This happens because the agent upgrade happened using the new path structure of elastic-agent-$version-$hash instead of the old path structure elastic-agent-$hash which fails to match the path check in the "old" watcher, leading to both versions of the agent being deleted.

// remove symlink to avoid upgrade failures, ignore error
prevSymlink := prevSymlinkPath()
log.Infow("Removing previous symlink path", "file.path", prevSymlinkPath())
_ = os.Remove(prevSymlink)
dirPrefix := fmt.Sprintf("%s-", agentName)
currentDir := fmt.Sprintf("%s-%s", agentName, currentHash)
for _, dir := range subdirs {
if dir == currentDir {
continue
}

The scenario where the watcher version that is started is not deterministic is always present, but it only result in a fatal error like this when doing a downgrade to a version before 8.13.0.

Proposed Solution

When the watcher gets the lock it writes the UPG_WATCHING state to the upgrade marker:

// About to start watching the upgrade. Initialize upgrade details and save them in the
// upgrade marker.
upgradeDetails := initUpgradeDetails(marker, upgrade.SaveMarker, log)

func initUpgradeDetails(marker *upgrade.UpdateMarker, saveMarker func(*upgrade.UpdateMarker, bool) error, log *logp.Logger) *details.Details {
upgradeDetails := details.NewDetails(version.GetAgentPackageVersion(), details.StateWatching, marker.GetActionID())
upgradeDetails.RegisterObserver(func(details *details.Details) {
marker.Details = details
if err := saveMarker(marker, true); err != nil {

The code that calls InvokeWatcher as part of the upgrade can poll the upgrade marker to watch for this UPG_WATCHING state update to confirm it executed far enough to get the lock.

We must do this polling on a timeout in case the update never occurs, initially the timeout is proposed to be 30s.

If we hit the timeout, we need to abort the upgrade and roll back to the state we were in before the upgrade started. This requires two things to happen:

  1. Killing the watcher PID to ensure it doesn't accidentally roll us back. If we can convince ourselves leaving the watcher running is fine we can skip this step.
  2. Rolling back the upgrade and putting the symlink back to what it was before. This can be done by calling the rollback install function already in place to do this:
    func rollbackInstall(ctx context.Context, log *logger.Logger, hash string) {
    os.RemoveAll(filepath.Join(paths.Data(), fmt.Sprintf("%s-%s", agentName, hash)))
    _ = ChangeSymlink(ctx, log, release.ShortCommit())
    }

Note that there is a separate bug in this function, we must rotate the symlink first, and then remove the "new" agent version. If we remove the "new" agent version first, and then the symlink rotation fails or the agent host is restarted the agent will fail to start up. We need to reverse the order of operations in this function.

@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Feb 8, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pchila
Copy link
Member

pchila commented Feb 13, 2024

Implemented correct order of install rollback and added wait for watcher to set WATCHING state in update marker.
Fixed by merging of #4229 and #4193 on main,

@pchila pchila closed this as completed Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
3 participants