Add configurable numbness for component restarts #2003

michalpristas · 2022-12-23T16:47:52Z

What does this PR do?

Using implementation of token bucket from v1 so I don't introduce component which may turn buggy.
Token bucket leases a N tokens per period. When caller cannot lease token I consider this as a sign too many restarts happened and report failure.

Tested locally ok for frequent, none or within limits restarts.

Spec files for processes having restart_on_output_change set this PR sets:

      restart_monitoring_period: 15s
      maximum_restarts_per_period: 3

@cmacknz took this from your example, personally i think this is too benevolent and we should be more strict.

Why is it important?

Avoiding Unhealthy status for expected behavior.
Agent turns unhealthy when component restarts to apply output change.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.
Closes Add specification file option to ignore unexpected process exits #1993

cmacknz · 2022-12-23T16:56:25Z

internal/pkg/tokenbucket/token_bucket.go

+// or more contributor license agreements. Licensed under the Elastic License;
+// you may not use this file except in compliance with the Elastic License.
+
+package tokenbucket


Could we use a preexisting implementation? https://pkg.go.dev/golang.org/x/time/rate perhaps?

https://pkg.go.dev/golang.org/x/time/rate#Limiter specifically is a token bucket rate limiter.

Disclaimer that I've never used this so it might be unsuitable.

i don't have an experience with that package as well but it seems a bit like overkill

The API is more complex, but the advantage is we don't have to maintain it and given it is a golang.org/x package we can trust it to generally be correct and kept up to date.

I think it is worth trying to replace your implementation with this one, I suspect the maintenance cost of it will be lower over time if there's any chance we start adding features to this tokenbucket package in the future.

specs/auditbeat.spec.yml

elasticmachine · 2022-12-23T17:03:11Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-12-29T07:55:17.131+0000
Duration: 16 min 57 sec

Test stats 🧪

Test	Results
Failed	0
Passed	4777
Skipped	13
Total	4790

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

michalpristas · 2022-12-23T17:31:59Z

will fix UT on tuesday

cmacknz · 2022-12-23T19:18:59Z

pkg/component/runtime/command.go

+		reportFailure := true
+		if c.restartBucket != nil {
+			// report failure only if bucket is full of restart events
+			reportFailure = tryAddToBucket(c.restartBucket)


I think we may want to log when we are suppressing the failed state as well so we can observe this working properly.

This would also let us notice when a restart we should not have suppressed gets suppressed.

cmacknz · 2022-12-23T19:21:15Z

The logs are reasonably verbose when this restart happens:

#1913 (comment)

I will be curious to see what the logs and state transitions look like when we suppress the transition to the failed state.

elasticmachine · 2022-12-27T07:41:11Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.305% (`58/59`)	👍
Files	69.268% (`142/205`)	👍
Classes	69.231% (`270/390`)	👍
Methods	54.158% (`827/1527`)	👍 0.03
Lines	39.305% (`8989/22870`)	👍 0.05
Conditionals	100.0% (`0/0`)	💚

cmacknz · 2022-12-28T16:26:05Z

go.mod

@@ -50,6 +50,7 @@ require (
 	golang.org/x/lint v0.0.0-20210508222113-6edffad5e616
 	golang.org/x/sync v0.0.0-20210220032951-036812b2e83c
 	golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8
+	golang.org/x/time v0.0.0-20210723032227-1f47c861a9ac


This commit is behind the head of the package: https://cs.opensource.google/go/x/time;bpv=1. It's from 2021-07-22 23:22 the latest is from 2022-11-16 10:19

Pin this to v0.3.0 which is also the latest commit in the package.

cmacknz · 2022-12-28T16:41:20Z

pkg/component/runtime/command_test.go

+
+	for name, tc := range testCases {
+		t.Run(name, func(t *testing.T) {
+			dropRate := 500 * time.Millisecond


Consider using shorter intervals, this entire test suite takes 6s to run which seems excessive:

ok github.com/elastic/elastic-agent/pkg/component/runtime 6.292s

Dividing everything by 10 works (so does 100 but maybe more chance of test flakiness in CI with some 100 microsecond intervals):

diff --git a/pkg/component/runtime/command_test.go b/pkg/component/runtime/command_test.go index ef361b33b..ca4ee184c 100644 --- a/pkg/component/runtime/command_test.go +++ b/pkg/component/runtime/command_test.go @@ -18,22 +18,22 @@ func TestAddToBucket(t *testing.T) { addSleep time.Duration shouldBlock bool }{ - "no error": {1, 0, 10 * time.Millisecond, false}, - "error within limit": {1, 1, 10 * time.Millisecond, false}, - "errors > than limit but across timespans": {1, 2, 800 * time.Millisecond, false}, - "errors > than limit within timespans, exact bucket size": {2, 2, 20 * time.Millisecond, false}, - "errors > than limit within timespans, off by one": {2, 3, 20 * time.Millisecond, true}, - "errors > than limit within timespans": {2, 4, 20 * time.Millisecond, true}, + "no error": {1, 0, 1 * time.Millisecond, false}, + "error within limit": {1, 1, 1 * time.Millisecond, false}, + "errors > than limit but across timespans": {1, 2, 80 * time.Millisecond, false}, + "errors > than limit within timespans, exact bucket size": {2, 2, 2 * time.Millisecond, false}, + "errors > than limit within timespans, off by one": {2, 3, 2 * time.Millisecond, true}, + "errors > than limit within timespans": {2, 4, 2 * time.Millisecond, true}, } for name, tc := range testCases { t.Run(name, func(t *testing.T) { - dropRate := 500 * time.Millisecond + dropRate := 50 * time.Millisecond b := newRateLimiter(dropRate, tc.bucketSize) blocked := false b.Allow() - <-time.After(dropRate + 200*time.Millisecond) // init ticker + <-time.After(dropRate + 20*time.Millisecond) // init ticker for i := 0; i < tc.add; i++ { wasBlocked := !b.Allow()

This gives ok github.com/elastic/elastic-agent/pkg/component/runtime 0.909s

cmacknz

Looks good, I would recommend at least updating the version of the rate package used before merging though.

cmacknz · 2022-12-29T02:16:25Z

specs/filebeat.spec.yml

+      restart_monitoring_period: 5s
+      maximum_restarts_per_period: 1


I've noticed there is a restart period we can configure that controls how long the process stays in the failed state before it is restarted. We should lower this for Beats that expect to restart on their own.

Defined here, defaults to 10s:

elastic-agent/pkg/component/spec.go

Lines 90 to 95 in 668fa22

// CommandTimeoutSpec is the timeout specification for subprocess.

type CommandTimeoutSpec struct {

Checkin time.Duration `config:"checkin" yaml:"checkin"`

Restart time.Duration `config:"restart" yaml:"restart"`

Stop time.Duration `config:"stop" yaml:"stop"`

}

Used here:

elastic-agent/pkg/component/runtime/command.go

Lines 134 to 142 in 668fa22

case ps := <-c.procCh:

// ignores old processes

if ps.proc == c.proc {

c.proc = nil

if c.handleProc(ps.state) {

// start again after restart period

t.Reset(restartPeriod)

}

}

I would suggest lowering this to 1s (or maybe even lower) for all of the Beats, otherwise there is a 10s period where there is no monitoring.

In the original implementation it looks like the restart on output change was immediate: https://github.com/elastic/beats/pull/24907/files#diff-4ef060299d18c75dcda518842dfc08e553e892a4fc1e9b2ff4e4a03640be3fff

I would make this restart period as low as possible to minimize the time where the Beat isn't actually running when it should be running.

cmacknz · 2022-12-29T02:26:41Z

pkg/component/runtime/command.go

+
+		if reportFailure {
+			stopMsg := fmt.Sprintf("Failed: pid '%d' exited with code '%d'", state.Pid(), state.ExitCode())
+			c.forceCompState(client.UnitStateFailed, stopMsg)


Unrelated to this PR, but more of an FYI. When the process restarts I don't think the agent clears any of the previous component state.

This causes the agent to send an initial CheckinExpected with non-zero config indexes and nil UnitExpectedConfigs since from the agent perspective nothing has changed, but to the new process it actually has no configuration yet and it's own config indexes have reset so it will try to apply the nil UnitExpectedConfigs.

For the Beats I believe this is the cause of the nil pointer dereferences that happen after an output change: elastic/beats#34137 (comment)

This additional panic and exit is going to count as a second restart in the 5s period, probably defeating the point of these changes on an intermittent basis.

cmacknz · 2022-12-29T03:03:33Z

pkg/component/runtime/command.go

+			stopMsg := fmt.Sprintf("Failed: pid '%d' exited with code '%d'", state.Pid(), state.ExitCode())
+			c.forceCompState(client.UnitStateFailed, stopMsg)
+		} else {
+			c.logger.Info("Suppressing FAILED state due to restart for '%d' exited with code '%d'", state.Pid(), state.ExitCode())


I think it might be better to transition the unit to stopped here so it is at least accurate without being an error, or per my comment above in https://github.com/elastic/elastic-agent/pull/2003/files#r1058695336 add some logic above that will clear out the current unit states so the agent is in sync with the Beats.

Possibly a modification to or a new variant of this function:

elastic-agent/pkg/component/runtime/state.go

Line 377 in 668fa22

func (s *ComponentState) forceState(state client.UnitState, msg string) bool {

…unce

michalpristas · 2022-12-29T07:54:09Z

just a note, decreasing restart timeout made whole transition smoother

Add configurable numbness for component restarts (#2003) (cherry picked from commit 464f9f7) # Conflicts: # specs/cloudbeat.spec.yml # specs/filebeat.spec.yml

#2025) * Add configurable numbness for component restarts (#2003) Add configurable numbness for component restarts (#2003) (cherry picked from commit 464f9f7) # Conflicts: # specs/cloudbeat.spec.yml # specs/filebeat.spec.yml * conflicts Co-authored-by: Michal Pristas <michal.pristas@gmail.com>

Add configurable numbness for component restarts

ab61277

michalpristas added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team skip-changelog backport-v8.6.0 Automated backport with mergify labels Dec 23, 2022

michalpristas self-assigned this Dec 23, 2022

michalpristas requested a review from a team as a code owner December 23, 2022 16:47

michalpristas requested review from AndersonQ and blakerouse and removed request for a team December 23, 2022 16:47

cmacknz reviewed Dec 23, 2022

View reviewed changes

specs/auditbeat.spec.yml Outdated Show resolved Hide resolved

remove log messages

2361074

michalpristas added 2 commits December 23, 2022 18:14

use anchors

984f143

lint

9cdd103

cmacknz reviewed Dec 23, 2022

View reviewed changes

logs and ut

47d9f4c

michalpristas added 4 commits December 27, 2022 09:22

using x golang

3ed081f

mod

a2b9ca1

notice

34beb57

once in 5

d9bf37f

michalpristas mentioned this pull request Dec 28, 2022

Agents goes to Unhealthy state temporarily on changing logging level under Agent Logs tab. #1912

Closed

cmacknz reviewed Dec 28, 2022

View reviewed changes

cmacknz approved these changes Dec 28, 2022

View reviewed changes

cmacknz reviewed Dec 29, 2022

View reviewed changes

michalpristas added 5 commits December 29, 2022 08:19

Merge branch 'main' of github.com:elastic/elastic-agent into fix/debo…

40f9fef

…unce

updated time package to 0.3.0

ea0ec5a

UT reduced wait time

7a245b6

restart timer decreased

ce457c7

notice

d9bf032

michalpristas merged commit 464f9f7 into elastic:main Dec 29, 2022

mergify bot mentioned this pull request Dec 29, 2022

[8.6](backport #2003) Add configurable numbness for component restarts #2025

Merged

cmacknz mentioned this pull request May 29, 2023

Clean up / document Agent policy processing #2728

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable numbness for component restarts #2003

Add configurable numbness for component restarts #2003

michalpristas commented Dec 23, 2022 •

edited by cmacknz

Loading

cmacknz Dec 23, 2022

michalpristas Dec 23, 2022

cmacknz Dec 23, 2022

cmacknz Dec 23, 2022

elasticmachine commented Dec 23, 2022 •

edited

Loading

Build stats

Test stats 🧪

michalpristas commented Dec 23, 2022

cmacknz Dec 23, 2022

cmacknz Dec 23, 2022

cmacknz commented Dec 23, 2022

elasticmachine commented Dec 27, 2022 •

edited

Loading

cmacknz Dec 28, 2022 •

edited

Loading

cmacknz Dec 28, 2022

cmacknz left a comment

cmacknz Dec 29, 2022

cmacknz Dec 29, 2022

cmacknz Dec 29, 2022

cmacknz Dec 29, 2022

cmacknz Dec 29, 2022 •

edited

Loading

michalpristas commented Dec 29, 2022

	// CommandTimeoutSpec is the timeout specification for subprocess.
	type CommandTimeoutSpec struct {
	Checkin time.Duration `config:"checkin" yaml:"checkin"`
	Restart time.Duration `config:"restart" yaml:"restart"`
	Stop time.Duration `config:"stop" yaml:"stop"`
	}

	case ps := <-c.procCh:
	// ignores old processes
	if ps.proc == c.proc {
	c.proc = nil
	if c.handleProc(ps.state) {
	// start again after restart period
	t.Reset(restartPeriod)
	}
	}

Add configurable numbness for component restarts #2003

Add configurable numbness for component restarts #2003

Conversation

michalpristas commented Dec 23, 2022 • edited by cmacknz Loading

What does this PR do?

Why is it important?

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Dec 23, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

michalpristas commented Dec 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz commented Dec 23, 2022

elasticmachine commented Dec 27, 2022 • edited Loading

🌐 Coverage report

cmacknz Dec 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz Dec 29, 2022 • edited Loading

Choose a reason for hiding this comment

michalpristas commented Dec 29, 2022

michalpristas commented Dec 23, 2022 •

edited by cmacknz

Loading

elasticmachine commented Dec 23, 2022 •

edited

Loading

elasticmachine commented Dec 27, 2022 •

edited

Loading

cmacknz Dec 28, 2022 •

edited

Loading

cmacknz Dec 29, 2022 •

edited

Loading