Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stop functionality for output config changes #34049

Closed

Conversation

fearful-symmetry
Copy link
Contributor

@fearful-symmetry fearful-symmetry commented Dec 14, 2022

What does this PR do?

Fixes elastic/elastic-agent#1913

This adds a reloader component, set by a output_restart config flag, that will stop the beat when it receives a config change to the output.

This will require some other component changes before it works properly

  • The agent needs to understand that the beat will restart on output change. Currently this will just result in the beats being marked as FAILED.
  • For beats where we want this enabled, we need to add -E management.output_restart=true to the beat specfile.

Testing this is easy. Just add the aforementioned line to the specfile, run the beat, then change the config. If there's a change to the output, the beats will shut down, agent will restart with the updated config.

Why is it important?

We still want output changes to force a reload of the beat.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@fearful-symmetry fearful-symmetry added Team:Elastic-Agent Label for the Agent team backport-v8.6.0 Automated backport with mergify labels Dec 14, 2022
@fearful-symmetry fearful-symmetry self-assigned this Dec 14, 2022
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner December 14, 2022 19:28
@fearful-symmetry fearful-symmetry requested review from belimawr and cmacknz and removed request for a team December 14, 2022 19:28
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 14, 2022
@cmacknz cmacknz requested a review from blakerouse December 14, 2022 19:43
@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2022

The agent needs to understand that the beat will restart on output change. Currently this will just result in the beats being marked as FAILED.

Does it eventually recover from this? Or does it just stay in the failed state?

Blacklist ConfigBlacklistSettings `config:"blacklist" yaml:"blacklist"`
Enabled bool `config:"enabled" yaml:"enabled"`
Blacklist ConfigBlacklistSettings `config:"blacklist" yaml:"blacklist"`
OutputRestart bool `config:"output_restart" yaml:"output_restart"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OutputRestart bool `config:"output_restart" yaml:"output_restart"`
RestartOnOutputChange bool `config:"restart_on_output_change" yaml:"output_restart"`

nit: Let's use the exact same name that was used in the 8.5 spec files so it is clear we just changed the implementation when comparing the two: https://github.com/elastic/elastic-agent/blob/f3d39629cc740ce7cf868a228da9e2dec1a83dec/internal/spec/filebeat.yml#L12

@@ -338,6 +346,16 @@ func (cm *BeatV2Manager) handleOutputReload(unit *client.Unit) {
}
cm.logger.Debugf("Got Output unit config '%s'", rawConfig.GetId())

// if needed, stop the beat, let agent restart
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are sure this only triggers when the actual output configuration changes? And it won't restart for changes in the unit that are not the actual output configuration, like the log level? https://github.com/elastic/elastic-agent-client/blob/15881a8e64ef95f32f688c933319cd926ff500b0/elastic-agent-client.proto#L280

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't think of that! Should probably add some kind of check to diff the actual config.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Dec 14, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-12-15T22:21:57.641+0000

  • Duration: 60 min 27 sec

Test stats 🧪

Test Results
Failed 0
Passed 7362
Skipped 338
Total 7700

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@belimawr belimawr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I believe we should implement the change suggested by @cmacknz, I believe it also makes the overall code a bit more readable.

Aside that I added minor suggestions.

@@ -87,12 +91,16 @@ func NewV2AgentManager(config *conf.C, registry *reload.Registry, _ uuid.UUID) (
// NewV2AgentManagerWithClient actually creates the manager instance used by the rest of the beats.
func NewV2AgentManagerWithClient(config *Config, registry *reload.Registry, agentClient client.V2) (lbmanagement.Manager, error) {
log := logp.NewLogger(lbmanagement.DebugK)
if config.OutputRestart {
log.Infof("Output reload is enabled, the beat will restart as needed on change of output config")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: No need to use Infof here, just Info as you're not formatting anything.

Suggested change
log.Infof("Output reload is enabled, the beat will restart as needed on change of output config")
log.Info("Output reload is enabled, the beat will restart as needed on change of output config")

@@ -338,6 +346,16 @@ func (cm *BeatV2Manager) handleOutputReload(unit *client.Unit) {
}
cm.logger.Debugf("Got Output unit config '%s'", rawConfig.GetId())

// if needed, stop the beat, let agent restart
if cm.stopOnOutputReload && cm.outputIsConfigured {
cm.logger.Infof("beat will now stop for output reload")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same thing here

Suggested change
cm.logger.Infof("beat will now stop for output reload")
cm.logger.Info("beat will now stop for output reload")

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@fearful-symmetry
Copy link
Contributor Author

for anyone reading: holding off on this until #34066 is merged.

@cmacknz
Copy link
Member

cmacknz commented Dec 20, 2022

The agent needs to understand that the beat will restart on output change. Currently this will just result in the beats being marked as FAILED.

Does it eventually recover from this? Or does it just stay in the failed state?

Spoke with Alex about this. The FAILED state lasts a second or two while the agent detects the process is stopped and restarts. If this happens right before the next checkin with fleet server we may have the agent reported unhealthy unnecessarily, both in that the behaviour causing the failed state is intentional and that the status won't clear until the next checkin up to 5 minutes later.

To avoid this we need to debounce the process state locally as described in elastic/elastic-agent#1946

@cmacknz
Copy link
Member

cmacknz commented Dec 21, 2022

These changes have been merged into #34066, closing.

@cmacknz cmacknz closed this Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beats no longer restart automatically when the output configuration changes.
4 participants