The version of the watcher that gets the watcher lock is not deterministic. #4228

cmacknz · 2024-02-08T17:23:25Z

Problem

One of the last steps we take in the upgrade process before re-executing the newly installed version is invoking the upgrade watcher:

elastic-agent/internal/pkg/agent/application/upgrade/upgrade.go

Lines 209 to 215 in d8f5134

    
           if err := InvokeWatcher(u.log); err != nil { 
        
           	u.log.Errorw("Rolling back: starting watcher failed", "error.message", err) 
        
           	rollbackInstall(ctx, u.log, newHash) 
        
           	return nil, err 
        
           } 
        
           cb := shutdownCallback(u.log, paths.Home(), release.Version(), version, release.TrimCommit(newHash))

Similarly, when the agent first starts up after a restart, including the restart for an upgrade, the watcher is also invoked:

elastic-agent/internal/pkg/agent/cmd/run.go

Lines 228 to 232 in d8f5134

    
           // initiate agent watcher 
        
           if err := upgrade.InvokeWatcher(l); err != nil { 
        
           	// we should not fail because watcher is not working 
        
           	l.Error(errors.New(err, "failed to invoke rollback watcher")) 
        
           }

The implementation of InvokeWatcher is as follows:

elastic-agent/internal/pkg/agent/application/upgrade/rollback.go

Lines 110 to 136 in d8f5134

    
           // InvokeWatcher invokes an agent instance using watcher argument for watching behavior of 
        
           // agent during upgrade period. 
        
           func InvokeWatcher(log *logger.Logger) error { 
        
           	if !IsUpgradeable() { 
        
           		log.Info("agent is not upgradable, not starting watcher") 
        
           		return nil 
        
           	} 
        
           	cmd := invokeCmd() 
        
           	defer func() { 
        
           		if cmd.Process != nil { 
        
           			log.Infof("releasing watcher %v", cmd.Process.Pid) 
        
           			_ = cmd.Process.Release() 
        
           		} 
        
           	}() 
        
           	log.Infow("Starting upgrade watcher", "path", cmd.Path, "args", cmd.Args, "env", cmd.Env, "dir", cmd.Dir) 
        
           	if err := cmd.Start(); err != nil { 
        
           		return fmt.Errorf("failed to start Upgrade Watcher: %w", err) 
        
           	} 
        
           	upgradeWatcherPID := cmd.Process.Pid 
        
           	agentPID := os.Getpid() 
        
           	log.Infow("Upgrade Watcher invoked", "agent.upgrade.watcher.process.pid", upgradeWatcherPID, "agent.process.pid", agentPID) 
        
           	return nil 
        
           }

To ensure only one version of the watcher is running at a time, we rely on a file lock:

elastic-agent/internal/pkg/agent/cmd/watch.go

Lines 79 to 91 in d8f5134

    
           locker := filelock.NewAppLocker(paths.Top(), watcherLockFile) 
        
           if err := locker.TryLock(); err != nil { 
        
           	if errors.Is(err, filelock.ErrAppAlreadyRunning) { 
        
           		log.Info("exiting, lock already exists") 
        
           		return nil 
        
           	} 
        
           	log.Error("failed to acquire lock", err) 
        
           	return err 
        
           } 
        
           defer func() { 
        
           	_ = locker.Unlock() 
        
           }()

There is a problem here, which is that our InvokeWatcher function only confirms that the watcher we attempted to launch from the upgrade was assigned a PID. It does not confirm that the version of the watcher launched as part of the upgrade is the one that got the lock.

If an upgrade is actually a downgrade, which is the case in many of our upgrade tests, this can result in the "old" version of the watcher getting the lock instead of the intended "new" version.

This exact situation has been confirmed as the source of a test failure in #4193 where the structure of the Elastic Agent paths has changed. See the logs captured for the TestStandaloneDowngradeToSpecificSnapshotBuild failure in https://buildkite.com/elastic/elastic-agent/builds/6911#018d8596-1f53-44e7-93ef-857bc1b4ff31. The "old" watcher does not understand the new path structure of the agent and so unconditionally removes both versions of the agent as part of cleanup, unintentionally uninstalling the agent.

In the logs linked above, we can see the current agent launch its watcher with hash b04c7c and PID 24769:

{"log.level":"info","@timestamp":"2024-02-07T22:45:54.896Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Starting upgrade watcher","log":{"source":"elastic-agent"},"path":"/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-b04c7c/elastic-agent","args":["/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-b04c7c/elastic-agent","watch","--path.config","/opt/Elastic/Agent","--path.home","/opt/Elastic/Agent"],"env":[],"dir":"","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:54.896Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.0-SNAPSHOT","state":"UPG_REPLACING","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:54.896Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.0-SNAPSHOT","state":"UPG_REPLACING","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:54.905Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":162},"message":"Upgrade Watcher invoked","log":{"source":"elastic-agent"},"agent.upgrade.watcher.process.pid":24769,"agent.process.pid":24739,"ecs.version":"1.6.0"}

The agent then re-execs the version we are downgrading to, which launches its version of the watcher with hash 8dcab2 and PID 24780.

{"log.level":"info","@timestamp":"2024-02-07T22:45:55.105Z","log.origin":{"file.name":"cmd/run.go","file.line":163},"message":"Elastic Agent started","log":{"source":"elastic-agent"},"process.pid":24739,"agent.version":"8.13.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:55.323Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":126},"message":"Starting upgrade watcher","log":{"source":"elastic-agent"},"path":"/opt/Elastic/Agent/elastic-agent","args":["/opt/Elastic/Agent/elastic-agent","watch","--path.config","/opt/Elastic/Agent","--path.home","/opt/Elastic/Agent"],"env":[],"dir":"","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:55.324Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":133},"message":"Upgrade Watcher invoked","log":{"source":"elastic-agent"},"agent.upgrade.watcher.process.pid":24780,"agent.process.pid":24739,"ecs.version":"1.6.0"}

The final logs we see for the running watcher are from this later PID 24780 running and then removing both versions of the agent:

{"log.level":"info","@timestamp":"2024-02-07T22:45:55.401Z","log.origin":{"file.name":"cmd/watch.go","file.line":67},"message":"Upgrade Watcher started","process.pid":24780,"agent.version":"8.13.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:45:55.406Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":66},"message":"Agent watcher started","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:10.407Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":133},"message":"Trying to connect to agent","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:10.408Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":164},"message":"Connected to agent","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:10.411Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":197},"message":"Communicating with PID 24739","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:54.901Z","log.origin":{"file.name":"cmd/watch.go","file.line":176},"message":"Grace period passed, not watching","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-02-07T22:46:54.901Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":184},"message":"Lost connection: failed reading next state: rpc error: code = Canceled desc = context canceled","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:54.906Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":59},"message":"Cleaning up upgrade","hash":"8dcab2","remove_marker":true,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:56.907Z","log.origin":{"file.name":"upgrade/step_mark.go","file.line":156},"message":"Removing marker file","file.path":"/opt/Elastic/Agent/data/.update-marker","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:56.907Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":82},"message":"Removing previous symlink path","file.path":"/opt/Elastic/Agent/elastic-agent.prev","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:56.907Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":97},"message":"Removing hashed data directory","file.path":"/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-b04c7c","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-02-07T22:46:57.198Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":97},"message":"Removing hashed data directory","file.path":"/opt/Elastic/Agent/data/elastic-agent-8.13.0-SNAPSHOT-8dcab2","ecs.version":"1.6.0"}

This happens because the agent upgrade happened using the new path structure of elastic-agent-$version-$hash instead of the old path structure elastic-agent-$hash which fails to match the path check in the "old" watcher, leading to both versions of the agent being deleted.

elastic-agent/internal/pkg/agent/application/upgrade/rollback.go

Lines 80 to 90 in d8f5134

    
           // remove symlink to avoid upgrade failures, ignore error 
        
           prevSymlink := prevSymlinkPath() 
        
           log.Infow("Removing previous symlink path", "file.path", prevSymlinkPath()) 
        
           _ = os.Remove(prevSymlink) 
        
           dirPrefix := fmt.Sprintf("%s-", agentName) 
        
           currentDir := fmt.Sprintf("%s-%s", agentName, currentHash) 
        
           for _, dir := range subdirs { 
        
           	if dir == currentDir { 
        
           		continue 
        
           	}

The scenario where the watcher version that is started is not deterministic is always present, but it only result in a fatal error like this when doing a downgrade to a version before 8.13.0.

Proposed Solution

When the watcher gets the lock it writes the UPG_WATCHING state to the upgrade marker:

elastic-agent/internal/pkg/agent/cmd/watch.go

Lines 107 to 110 in d8f5134

    
           // About to start watching the upgrade. Initialize upgrade details and save them in the 
        
           // upgrade marker. 
        
           upgradeDetails := initUpgradeDetails(marker, upgrade.SaveMarker, log)

elastic-agent/internal/pkg/agent/cmd/watch.go

Lines 238 to 242 in d8f5134

    
           func initUpgradeDetails(marker *upgrade.UpdateMarker, saveMarker func(*upgrade.UpdateMarker, bool) error, log *logp.Logger) *details.Details { 
        
           	upgradeDetails := details.NewDetails(version.GetAgentPackageVersion(), details.StateWatching, marker.GetActionID()) 
        
           	upgradeDetails.RegisterObserver(func(details *details.Details) { 
        
           		marker.Details = details 
        
           		if err := saveMarker(marker, true); err != nil {

The code that calls InvokeWatcher as part of the upgrade can poll the upgrade marker to watch for this UPG_WATCHING state update to confirm it executed far enough to get the lock.

We must do this polling on a timeout in case the update never occurs, initially the timeout is proposed to be 30s.

If we hit the timeout, we need to abort the upgrade and roll back to the state we were in before the upgrade started. This requires two things to happen:

Killing the watcher PID to ensure it doesn't accidentally roll us back. If we can convince ourselves leaving the watcher running is fine we can skip this step.

Rolling back the upgrade and putting the symlink back to what it was before. This can be done by calling the rollback install function already in place to do this:

elastic-agent/internal/pkg/agent/application/upgrade/upgrade.go

Lines 272 to 275 in d8f5134

    
           func rollbackInstall(ctx context.Context, log *logger.Logger, hash string) { 
        
           	os.RemoveAll(filepath.Join(paths.Data(), fmt.Sprintf("%s-%s", agentName, hash))) 
        
           	_ = ChangeSymlink(ctx, log, release.ShortCommit()) 
        
           }

Note that there is a separate bug in this function, we must rotate the symlink first, and then remove the "new" agent version. If we remove the "new" agent version first, and then the symlink rotation fails or the agent host is restarted the agent will fail to start up. We need to reverse the order of operations in this function.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-02-08T17:23:27Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

pchila · 2024-02-13T07:46:23Z

Implemented correct order of install rollback and added wait for watcher to set WATCHING state in update marker.
Fixed by merging of #4229 and #4193 on main,

cmacknz added the Team:Elastic-Agent label Feb 8, 2024

cmacknz assigned pchila Feb 8, 2024

cmacknz mentioned this issue Feb 8, 2024

Feature/add agent version to installation directory name #4193

Merged

5 tasks

This was referenced Feb 8, 2024

Add wait for watcher #4229

Merged

Shut down GRPC server gracefully #4238

Closed

This was linked to pull requests Feb 13, 2024

Feature/add agent version to installation directory name #4193

Merged

Add wait for watcher #4229

Merged

pchila mentioned this issue Feb 13, 2024

[Flaky Test]: TestStandaloneUpgrade/Upgrade_8.12.0-SNAPSHOT_to_8.13.0-SNAPSHOT_(privileged) – error installing package: failed to copy source directory... /opt/Elastic/Agent/elastic-agent: file exists #4204

Closed

pchila closed this as completed Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The version of the watcher that gets the watcher lock is not deterministic. #4228

The version of the watcher that gets the watcher lock is not deterministic. #4228

cmacknz commented Feb 8, 2024 •

edited

Loading

elasticmachine commented Feb 8, 2024

pchila commented Feb 13, 2024

The version of the watcher that gets the watcher lock is not deterministic. #4228

The version of the watcher that gets the watcher lock is not deterministic. #4228

Comments

cmacknz commented Feb 8, 2024 • edited Loading

Problem

Proposed Solution

elasticmachine commented Feb 8, 2024

pchila commented Feb 13, 2024

cmacknz commented Feb 8, 2024 •

edited

Loading