Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test]: TestContainerCMD – TempDir RemoveAll cleanup: unlinkat .../data: directory not empty #4246

Closed
pchila opened this issue Feb 12, 2024 · 9 comments · Fixed by #4695
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@pchila
Copy link
Member

pchila commented Feb 12, 2024

Failing test case

TestContainerCMD

Error message

One of the directories created with t.TempDir() cannot be cleaned up

Build

https://buildkite.com/elastic/elastic-agent/builds/7054#018d9e30-e528-45a6-b191-402d9e7a6b89

OS

Linux

Stacktrace and notes

=== RUN   TestContainerCMD
    container_cmd_test.go:70: Creating enrollment API key...
    fixture.go:262: Extracting artifact elastic-agent-8.13.0-SNAPSHOT-linux-x86_64.tar.gz to /tmp/TestContainerCMD148746147/001
    fixture.go:275: Completed extraction of artifact elastic-agent-8.13.0-SNAPSHOT-linux-x86_64.tar.gz to /tmp/TestContainerCMD148746147/001
    fixture.go:865: Components were not modified from the fetched artifact
    container_cmd_test.go:113: >> running binary with: [/tmp/TestContainerCMD148746147/001/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64/elastic-agent container]
    fixture.go:632: >> running binary with: [/tmp/TestContainerCMD148746147/001/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64/elastic-agent status --output json]
    fixture.go:632: >> running binary with: [/tmp/TestContainerCMD148746147/001/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64/elastic-agent status --output json]
    fixture.go:632: >> running binary with: [/tmp/TestContainerCMD148746147/001/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64/elastic-agent status --output json]
    fixture.go:632: >> running binary with: [/tmp/TestContainerCMD148746147/001/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64/elastic-agent status --output json]
    fixture.go:632: >> running binary with: [/tmp/TestContainerCMD148746147/001/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64/elastic-agent status --output json]
    container_cmd_test.go:90: >> cleaning up: killing the Elastic-Agent process
    testing.go:1225: TempDir RemoveAll cleanup: unlinkat /tmp/TestContainerCMD148746147/001/elastic-agent-8.13.0-SNAPSHOT-linux-x86_64/data: directory not empty
--- FAIL: TestContainerCMD (28.62s)

It happened once only on the specific build, still logging it here for tracking
@pchila pchila added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Feb 12, 2024
@pchila pchila self-assigned this Feb 12, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@AndersonQ
Copy link
Member

AndersonQ commented Mar 4, 2024

It happened once only on the specific build, still logging it here for tracking

it happened on my PR as well: https://buildkite.com/elastic/elastic-agent/builds/7454#018dd49c-2570-483c-b055-2a64a7a20ada

@rdner
Copy link
Member

rdner commented Mar 22, 2024

@rdner
Copy link
Member

rdner commented May 6, 2024

@cmacknz
Copy link
Member

cmacknz commented May 6, 2024

CC @belimawr

The test cleanup is in cmd.Process.Kill?

t.Cleanup(func() {
if cmd.Process != nil {
t.Log(">> cleaning up: killing the Elastic-Agent process")
if err := cmd.Process.Kill(); err != nil {
t.Fatalf("could not kill Elastic-Agent process: %s", err)
}
return
}
t.Log(">> cleaning up: no process to kill")
})

The command is started via

cmd := exec.CommandContext(ctx, f.binaryPath(), args...)
for _, o := range opts {
if err := o(cmd); err != nil {
return nil, fmt.Errorf("error adding opts to Exec: %w", err)
}
}

Cancelling the context should kill the process and let us stop using the cleanup function, we could also gracefully call Stop before the test ends.

I do think the actual problem is that something is still accessing the STATE_PATH at the time when we try to remove the temporary directory, so removing use of Cleanup might just change the timing. The work dir is created with TempDir here

workDir := f.t.TempDir()

See golang/go#43547

Possibly the real fix is to poll for the elastic-agent container command to have actually stopped when we kill it.

@belimawr
Copy link
Contributor

belimawr commented May 7, 2024

@cmacknz

The test cleanup is in cmd.Process.Kill?

What do you mean?

I call cmd.Process.Kill in t.Cleanup to ensure the Elastic-Agent has exited before the test ends. We had multiple issues Elastic-Agent process left behind and affecting other tests, so I wanted to be sure we were not leaking any Elastic-Agent process.

Possibly the real fix is to poll for the elastic-agent container command to have actually stopped when we kill it.

I agree. You mean in the Cleanup function, after we call cmd.Process.Kill() we poll to make sure the process has actually exited, is that it?

I checked the docs for os.Process and it clearly states that Kill does not wait for the process to exit 🤦‍♂️

Kill causes the Process to exit immediately. Kill does not wait until the Process has actually exited. This only kills the Process itself, not any other processes it may have started.

It seems we have to call Wait right after calling Kill.

@cmacknz
Copy link
Member

cmacknz commented May 7, 2024

What do you mean?

I mean why do we call cmd.Process.Kill when cancelling the context does the same thing, but apparently just wrote a bunch of nonsense instead :p

It looks like the cancel() in CommandContext doesn't Wait either, so you nailed it we need to call Wait() after calling Start() here.

@belimawr
Copy link
Contributor

belimawr commented May 7, 2024

I'l make a quick PR to fix that.

@belimawr
Copy link
Contributor

belimawr commented May 7, 2024

Here is the fix: #4695

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants