Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kubernetes leader election] Run leader elector at all times #4542
[Kubernetes leader election] Run leader elector at all times #4542
Changes from 2 commits
74f9da7
837afcc
e56457f
f2399de
398fce6
efa6d27
418026d
d5d814d
20b14e2
2b00475
6efdd1c
ab26600
dafa82c
c4f8046
3e34595
f143a35
b3cf472
90d435f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a cluster error resulting in the leader continuously losing the lease, will this result in attempts to acquire it as quickly as possible with no rate limit?
The implementation of
Run
I see is:There have been several escalations showing that failing to appropriately rate limit k8s control plane API calls like leader election can destabilize clusters.
What testing have we done to ensure this change won't cause issues like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already been doing this. This change will not affect the number of calls, it just makes sure that at least 1 agent will be reporting metrics. The problem with the implementation now is that run goes like this:
All the while, all the other instances were trying to acquire the lease already.
We do not have many SDHs on this bug, which leads me to believe that it is rare for an agent to lose the lease. But we do have problems when agents stop reporting metrics, and the only way we knew how to restart that was to make the pod run again - that is, force
run()
to run again.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some parameters in the config regarding the lease:
elastic-agent/internal/pkg/composable/providers/kubernetesleaderelection/config.go
Lines 17 to 19 in 43cb148
If it becomes necessary to reduce the amount of times an agent tries to acquire it.