-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configurable numbness for component restarts #2003
Conversation
// or more contributor license agreements. Licensed under the Elastic License; | ||
// you may not use this file except in compliance with the Elastic License. | ||
|
||
package tokenbucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a preexisting implementation? https://pkg.go.dev/golang.org/x/time/rate perhaps?
https://pkg.go.dev/golang.org/x/time/rate#Limiter specifically is a token bucket rate limiter.
Disclaimer that I've never used this so it might be unsuitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't have an experience with that package as well but it seems a bit like overkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API is more complex, but the advantage is we don't have to maintain it and given it is a golang.org/x package we can trust it to generally be correct and kept up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth trying to replace your implementation with this one, I suspect the maintenance cost of it will be lower over time if there's any chance we start adding features to this tokenbucket package in the future.
will fix UT on tuesday |
pkg/component/runtime/command.go
Outdated
reportFailure := true | ||
if c.restartBucket != nil { | ||
// report failure only if bucket is full of restart events | ||
reportFailure = tryAddToBucket(c.restartBucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to log when we are suppressing the failed state as well so we can observe this working properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also let us notice when a restart we should not have suppressed gets suppressed.
The logs are reasonably verbose when this restart happens: I will be curious to see what the logs and state transitions look like when we suppress the transition to the failed state. |
🌐 Coverage report
|
go.mod
Outdated
@@ -50,6 +50,7 @@ require ( | |||
golang.org/x/lint v0.0.0-20210508222113-6edffad5e616 | |||
golang.org/x/sync v0.0.0-20210220032951-036812b2e83c | |||
golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8 | |||
golang.org/x/time v0.0.0-20210723032227-1f47c861a9ac |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is behind the head of the package: https://cs.opensource.google/go/x/time;bpv=1. It's from 2021-07-22 23:22 the latest is from 2022-11-16 10:19
Pin this to v0.3.0 which is also the latest commit in the package.
|
||
for name, tc := range testCases { | ||
t.Run(name, func(t *testing.T) { | ||
dropRate := 500 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using shorter intervals, this entire test suite takes 6s to run which seems excessive:
ok github.com/elastic/elastic-agent/pkg/component/runtime 6.292s
Dividing everything by 10 works (so does 100 but maybe more chance of test flakiness in CI with some 100 microsecond intervals):
diff --git a/pkg/component/runtime/command_test.go b/pkg/component/runtime/command_test.go
index ef361b33b..ca4ee184c 100644
--- a/pkg/component/runtime/command_test.go
+++ b/pkg/component/runtime/command_test.go
@@ -18,22 +18,22 @@ func TestAddToBucket(t *testing.T) {
addSleep time.Duration
shouldBlock bool
}{
- "no error": {1, 0, 10 * time.Millisecond, false},
- "error within limit": {1, 1, 10 * time.Millisecond, false},
- "errors > than limit but across timespans": {1, 2, 800 * time.Millisecond, false},
- "errors > than limit within timespans, exact bucket size": {2, 2, 20 * time.Millisecond, false},
- "errors > than limit within timespans, off by one": {2, 3, 20 * time.Millisecond, true},
- "errors > than limit within timespans": {2, 4, 20 * time.Millisecond, true},
+ "no error": {1, 0, 1 * time.Millisecond, false},
+ "error within limit": {1, 1, 1 * time.Millisecond, false},
+ "errors > than limit but across timespans": {1, 2, 80 * time.Millisecond, false},
+ "errors > than limit within timespans, exact bucket size": {2, 2, 2 * time.Millisecond, false},
+ "errors > than limit within timespans, off by one": {2, 3, 2 * time.Millisecond, true},
+ "errors > than limit within timespans": {2, 4, 2 * time.Millisecond, true},
}
for name, tc := range testCases {
t.Run(name, func(t *testing.T) {
- dropRate := 500 * time.Millisecond
+ dropRate := 50 * time.Millisecond
b := newRateLimiter(dropRate, tc.bucketSize)
blocked := false
b.Allow()
- <-time.After(dropRate + 200*time.Millisecond) // init ticker
+ <-time.After(dropRate + 20*time.Millisecond) // init ticker
for i := 0; i < tc.add; i++ {
wasBlocked := !b.Allow()
This gives ok github.com/elastic/elastic-agent/pkg/component/runtime 0.909s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I would recommend at least updating the version of the rate package used before merging though.
restart_monitoring_period: 5s | ||
maximum_restarts_per_period: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed there is a restart period we can configure that controls how long the process stays in the failed state before it is restarted. We should lower this for Beats that expect to restart on their own.
Defined here, defaults to 10s:
elastic-agent/pkg/component/spec.go
Lines 90 to 95 in 668fa22
// CommandTimeoutSpec is the timeout specification for subprocess. | |
type CommandTimeoutSpec struct { | |
Checkin time.Duration `config:"checkin" yaml:"checkin"` | |
Restart time.Duration `config:"restart" yaml:"restart"` | |
Stop time.Duration `config:"stop" yaml:"stop"` | |
} |
Used here:
elastic-agent/pkg/component/runtime/command.go
Lines 134 to 142 in 668fa22
case ps := <-c.procCh: | |
// ignores old processes | |
if ps.proc == c.proc { | |
c.proc = nil | |
if c.handleProc(ps.state) { | |
// start again after restart period | |
t.Reset(restartPeriod) | |
} | |
} |
I would suggest lowering this to 1s (or maybe even lower) for all of the Beats, otherwise there is a 10s period where there is no monitoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original implementation it looks like the restart on output change was immediate: https://github.com/elastic/beats/pull/24907/files#diff-4ef060299d18c75dcda518842dfc08e553e892a4fc1e9b2ff4e4a03640be3fff
I would make this restart period as low as possible to minimize the time where the Beat isn't actually running when it should be running.
|
||
if reportFailure { | ||
stopMsg := fmt.Sprintf("Failed: pid '%d' exited with code '%d'", state.Pid(), state.ExitCode()) | ||
c.forceCompState(client.UnitStateFailed, stopMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but more of an FYI. When the process restarts I don't think the agent clears any of the previous component state.
This causes the agent to send an initial CheckinExpected with non-zero config indexes and nil UnitExpectedConfigs since from the agent perspective nothing has changed, but to the new process it actually has no configuration yet and it's own config indexes have reset so it will try to apply the nil UnitExpectedConfigs.
For the Beats I believe this is the cause of the nil pointer dereferences that happen after an output change: elastic/beats#34137 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This additional panic and exit is going to count as a second restart in the 5s period, probably defeating the point of these changes on an intermittent basis.
pkg/component/runtime/command.go
Outdated
stopMsg := fmt.Sprintf("Failed: pid '%d' exited with code '%d'", state.Pid(), state.ExitCode()) | ||
c.forceCompState(client.UnitStateFailed, stopMsg) | ||
} else { | ||
c.logger.Info("Suppressing FAILED state due to restart for '%d' exited with code '%d'", state.Pid(), state.ExitCode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to transition the unit to stopped here so it is at least accurate without being an error, or per my comment above in https://github.com/elastic/elastic-agent/pull/2003/files#r1058695336 add some logic above that will clear out the current unit states so the agent is in sync with the Beats.
Possibly a modification to or a new variant of this function:
elastic-agent/pkg/component/runtime/state.go
Line 377 in 668fa22
func (s *ComponentState) forceState(state client.UnitState, msg string) bool { |
just a note, decreasing restart timeout made whole transition smoother |
What does this PR do?
Using implementation of token bucket from v1 so I don't introduce component which may turn buggy.
Token bucket leases a N tokens per period. When caller cannot lease token I consider this as a sign too many restarts happened and report failure.
Tested locally ok for frequent, none or within limits restarts.
Spec files for processes having
restart_on_output_change
set this PR sets:@cmacknz took this from your example, personally i think this is too benevolent and we should be more strict.
Why is it important?
Avoiding Unhealthy status for expected behavior.
Agent turns unhealthy when component restarts to apply output change.
Checklist
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Closes Add specification file option to ignore unexpected process exits #1993