Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurable numbness for component restarts #2003

Merged
merged 14 commits into from
Dec 29, 2022

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Dec 23, 2022

What does this PR do?

Using implementation of token bucket from v1 so I don't introduce component which may turn buggy.
Token bucket leases a N tokens per period. When caller cannot lease token I consider this as a sign too many restarts happened and report failure.

Tested locally ok for frequent, none or within limits restarts.

Spec files for processes having restart_on_output_change set this PR sets:

      restart_monitoring_period: 15s
      maximum_restarts_per_period: 3

@cmacknz took this from your example, personally i think this is too benevolent and we should be more strict.

Why is it important?

Avoiding Unhealthy status for expected behavior.
Agent turns unhealthy when component restarts to apply output change.

Checklist

  • My code follows the style guidelines of this project

  • I have commented my code, particularly in hard-to-understand areas

  • I have made corresponding changes to the documentation

  • I have made corresponding change to the default configuration files

  • I have added tests that prove my fix is effective or that my feature works

  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

  • Closes Add specification file option to ignore unexpected process exits  #1993

@michalpristas michalpristas added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team skip-changelog backport-v8.6.0 Automated backport with mergify labels Dec 23, 2022
@michalpristas michalpristas self-assigned this Dec 23, 2022
@michalpristas michalpristas requested a review from a team as a code owner December 23, 2022 16:47
@michalpristas michalpristas requested review from AndersonQ and blakerouse and removed request for a team December 23, 2022 16:47
// or more contributor license agreements. Licensed under the Elastic License;
// you may not use this file except in compliance with the Elastic License.

package tokenbucket
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a preexisting implementation? https://pkg.go.dev/golang.org/x/time/rate perhaps?

https://pkg.go.dev/golang.org/x/time/rate#Limiter specifically is a token bucket rate limiter.

Disclaimer that I've never used this so it might be unsuitable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't have an experience with that package as well but it seems a bit like overkill

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API is more complex, but the advantage is we don't have to maintain it and given it is a golang.org/x package we can trust it to generally be correct and kept up to date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth trying to replace your implementation with this one, I suspect the maintenance cost of it will be lower over time if there's any chance we start adding features to this tokenbucket package in the future.

@elasticmachine
Copy link
Contributor

elasticmachine commented Dec 23, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-12-29T07:55:17.131+0000

  • Duration: 16 min 57 sec

Test stats 🧪

Test Results
Failed 0
Passed 4777
Skipped 13
Total 4790

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@michalpristas
Copy link
Contributor Author

will fix UT on tuesday

reportFailure := true
if c.restartBucket != nil {
// report failure only if bucket is full of restart events
reportFailure = tryAddToBucket(c.restartBucket)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to log when we are suppressing the failed state as well so we can observe this working properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also let us notice when a restart we should not have suppressed gets suppressed.

@cmacknz
Copy link
Member

cmacknz commented Dec 23, 2022

The logs are reasonably verbose when this restart happens:

#1913 (comment)

I will be curious to see what the logs and state transitions look like when we suppress the transition to the failed state.

@elasticmachine
Copy link
Contributor

elasticmachine commented Dec 27, 2022

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.305% (58/59) 👍
Files 69.268% (142/205) 👍
Classes 69.231% (270/390) 👍
Methods 54.158% (827/1527) 👍 0.03
Lines 39.305% (8989/22870) 👍 0.05
Conditionals 100.0% (0/0) 💚

go.mod Outdated
@@ -50,6 +50,7 @@ require (
golang.org/x/lint v0.0.0-20210508222113-6edffad5e616
golang.org/x/sync v0.0.0-20210220032951-036812b2e83c
golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8
golang.org/x/time v0.0.0-20210723032227-1f47c861a9ac
Copy link
Member

@cmacknz cmacknz Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is behind the head of the package: https://cs.opensource.google/go/x/time;bpv=1. It's from 2021-07-22 23:22 the latest is from 2022-11-16 10:19

Pin this to v0.3.0 which is also the latest commit in the package.


for name, tc := range testCases {
t.Run(name, func(t *testing.T) {
dropRate := 500 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using shorter intervals, this entire test suite takes 6s to run which seems excessive:

ok      github.com/elastic/elastic-agent/pkg/component/runtime  6.292s

Dividing everything by 10 works (so does 100 but maybe more chance of test flakiness in CI with some 100 microsecond intervals):

diff --git a/pkg/component/runtime/command_test.go b/pkg/component/runtime/command_test.go
index ef361b33b..ca4ee184c 100644
--- a/pkg/component/runtime/command_test.go
+++ b/pkg/component/runtime/command_test.go
@@ -18,22 +18,22 @@ func TestAddToBucket(t *testing.T) {
                addSleep    time.Duration
                shouldBlock bool
        }{
-               "no error":           {1, 0, 10 * time.Millisecond, false},
-               "error within limit": {1, 1, 10 * time.Millisecond, false},
-               "errors > than limit but across timespans":                {1, 2, 800 * time.Millisecond, false},
-               "errors > than limit within timespans, exact bucket size": {2, 2, 20 * time.Millisecond, false},
-               "errors > than limit within timespans, off by one":        {2, 3, 20 * time.Millisecond, true},
-               "errors > than limit within timespans":                    {2, 4, 20 * time.Millisecond, true},
+               "no error":           {1, 0, 1 * time.Millisecond, false},
+               "error within limit": {1, 1, 1 * time.Millisecond, false},
+               "errors > than limit but across timespans":                {1, 2, 80 * time.Millisecond, false},
+               "errors > than limit within timespans, exact bucket size": {2, 2, 2 * time.Millisecond, false},
+               "errors > than limit within timespans, off by one":        {2, 3, 2 * time.Millisecond, true},
+               "errors > than limit within timespans":                    {2, 4, 2 * time.Millisecond, true},
        }

        for name, tc := range testCases {
                t.Run(name, func(t *testing.T) {
-                       dropRate := 500 * time.Millisecond
+                       dropRate := 50 * time.Millisecond
                        b := newRateLimiter(dropRate, tc.bucketSize)

                        blocked := false
                        b.Allow()
-                       <-time.After(dropRate + 200*time.Millisecond) // init ticker
+                       <-time.After(dropRate + 20*time.Millisecond) // init ticker

                        for i := 0; i < tc.add; i++ {
                                wasBlocked := !b.Allow()

This gives ok github.com/elastic/elastic-agent/pkg/component/runtime 0.909s

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I would recommend at least updating the version of the rate package used before merging though.

Comment on lines +21 to +22
restart_monitoring_period: 5s
maximum_restarts_per_period: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed there is a restart period we can configure that controls how long the process stays in the failed state before it is restarted. We should lower this for Beats that expect to restart on their own.

Defined here, defaults to 10s:

// CommandTimeoutSpec is the timeout specification for subprocess.
type CommandTimeoutSpec struct {
Checkin time.Duration `config:"checkin" yaml:"checkin"`
Restart time.Duration `config:"restart" yaml:"restart"`
Stop time.Duration `config:"stop" yaml:"stop"`
}

Used here:

case ps := <-c.procCh:
// ignores old processes
if ps.proc == c.proc {
c.proc = nil
if c.handleProc(ps.state) {
// start again after restart period
t.Reset(restartPeriod)
}
}

I would suggest lowering this to 1s (or maybe even lower) for all of the Beats, otherwise there is a 10s period where there is no monitoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original implementation it looks like the restart on output change was immediate: https://github.com/elastic/beats/pull/24907/files#diff-4ef060299d18c75dcda518842dfc08e553e892a4fc1e9b2ff4e4a03640be3fff

I would make this restart period as low as possible to minimize the time where the Beat isn't actually running when it should be running.


if reportFailure {
stopMsg := fmt.Sprintf("Failed: pid '%d' exited with code '%d'", state.Pid(), state.ExitCode())
c.forceCompState(client.UnitStateFailed, stopMsg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but more of an FYI. When the process restarts I don't think the agent clears any of the previous component state.

This causes the agent to send an initial CheckinExpected with non-zero config indexes and nil UnitExpectedConfigs since from the agent perspective nothing has changed, but to the new process it actually has no configuration yet and it's own config indexes have reset so it will try to apply the nil UnitExpectedConfigs.

For the Beats I believe this is the cause of the nil pointer dereferences that happen after an output change: elastic/beats#34137 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This additional panic and exit is going to count as a second restart in the 5s period, probably defeating the point of these changes on an intermittent basis.

stopMsg := fmt.Sprintf("Failed: pid '%d' exited with code '%d'", state.Pid(), state.ExitCode())
c.forceCompState(client.UnitStateFailed, stopMsg)
} else {
c.logger.Info("Suppressing FAILED state due to restart for '%d' exited with code '%d'", state.Pid(), state.ExitCode())
Copy link
Member

@cmacknz cmacknz Dec 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to transition the unit to stopped here so it is at least accurate without being an error, or per my comment above in https://github.com/elastic/elastic-agent/pull/2003/files#r1058695336 add some logic above that will clear out the current unit states so the agent is in sync with the Beats.

Possibly a modification to or a new variant of this function:

func (s *ComponentState) forceState(state client.UnitState, msg string) bool {

@michalpristas
Copy link
Contributor Author

just a note, decreasing restart timeout made whole transition smoother

@michalpristas michalpristas merged commit 464f9f7 into elastic:main Dec 29, 2022
mergify bot pushed a commit that referenced this pull request Dec 29, 2022
Add configurable numbness for component restarts (#2003)

(cherry picked from commit 464f9f7)

# Conflicts:
#	specs/cloudbeat.spec.yml
#	specs/filebeat.spec.yml
michalpristas added a commit that referenced this pull request Dec 29, 2022
#2025)

* Add configurable numbness for component restarts (#2003)

Add configurable numbness for component restarts (#2003)

(cherry picked from commit 464f9f7)

# Conflicts:
#	specs/cloudbeat.spec.yml
#	specs/filebeat.spec.yml

* conflicts

Co-authored-by: Michal Pristas <michal.pristas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify bug Something isn't working skip-changelog Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add specification file option to ignore unexpected process exits
3 participants