Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shut down GRPC server gracefully #4238

Closed
wants to merge 3 commits into from

Conversation

pchila
Copy link
Member

@pchila pchila commented Feb 12, 2024

What does this PR do?

This PR changes the grpc server Stop() call to grpc.GracefulStop().

Why is it important?

This change is needed for the cases when there is a race between returning a result of an action and a shutdown/restart of agent. We have at least one such case for the elastic-agent upgrade command where sometimes the CLI reports an error such as EOF but the upgrade is successful: the agent just shut down the grpc server too quickly while restarting and the connection with the client gets closed before RPC result could be sent back (see #3890).

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila added the bug Something isn't working label Feb 12, 2024
@pchila pchila self-assigned this Feb 12, 2024
@pchila pchila requested a review from a team as a code owner February 12, 2024 10:37
@pchila pchila added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team backport-v8.12.0 Automated backport with mergify labels Feb 12, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pchila pchila requested a review from rdner February 12, 2024 10:38
@pchila
Copy link
Member Author

pchila commented Feb 12, 2024

This is being tested on top of the PR for #4228 and #2579 https://buildkite.com/elastic/elastic-agent/builds/7004 as that branch seems to bring out the issue pretty consistently

Cancelled the build on the other PR because it was slowing down too much the CI run 😞

@pchila pchila mentioned this pull request Feb 12, 2024
3 tasks
@pchila
Copy link
Member Author

pchila commented Feb 12, 2024

A simple change from Stop() to GracefulStop() seems to cause slowdown and failures in our integration tests. We probably need to fix other parts of the agent to make sure that we stop in time but still gracefully.

@pchila pchila marked this pull request as draft February 12, 2024 13:13
@pchila pchila force-pushed the stop-grpc-server-gracefully branch from ffafeeb to e26b61e Compare February 13, 2024 07:25
@pchila pchila force-pushed the stop-grpc-server-gracefully branch from e26b61e to 85ed162 Compare February 13, 2024 09:55
Copy link

Comment on lines +28 to +32
#pr: https://github.com/owner/repo/1234

# Issue URL; optional; the GitHub issue related to this changeset (either closes or is part of).
# If not present is automatically filled by the tooling with the issue linked to the PR number.
#issue: https://github.com/owner/repo/1234
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add either the PR or the issue

@@ -112,10 +112,18 @@ func (s *Server) Start() error {
// Stop stops the GRPC endpoint.
func (s *Server) Stop() {
if s.server != nil {
s.server.Stop()
s.logger.Info("Stopping GRPC server...")
s.server.GracefulStop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps there could be a timeout here, just to make sure we don't hang here forever. Unless there is a timeout on a upper layer. I just would like to avoid introduce something that might let the agent hanging during shutdown

@pchila pchila marked this pull request as ready for review February 22, 2024 16:46
@pchila pchila marked this pull request as draft February 22, 2024 16:46
@rdner
Copy link
Member

rdner commented Mar 22, 2024

@pchila it's been more than a month, any updates?

@pchila
Copy link
Member Author

pchila commented Mar 22, 2024

@pchila it's been more than a month, any updates?

No, not really and this is not among the priorities for the foreseeable future. I am going to close this and remove my assignment from the related issue so someone else can have a go.

@pchila pchila closed this Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.12.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The elastic-agent upgrade command can fail even though the upgrade succeeds
4 participants