Skip to content

Retry Attempts are not available in client-stats #8299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
davinci26 opened this issue May 7, 2025 · 12 comments · May be fixed by #8342
Open

Retry Attempts are not available in client-stats #8299

davinci26 opened this issue May 7, 2025 · 12 comments · May be fixed by #8342
Assignees
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability P1 Type: Bug

Comments

@davinci26
Copy link

Clients can configure non transparent retries either via XDS or via ServiceConfig however the stats handler have no good way to retrieve if a particular RPC method is a retry or not.

This type of observability is quite useful to determine if retry-storms(due to poor configurations) cause incidents and/or to see if retries are actually effective.

I thought of looking at the retry metadata header "grpc-previous-rpc-attempts" but this one is added by the http2_client as a header after the HandleRPC method is called so that didnt yield any results.

My first question is: Is this actually true or is there a way to get o11y on the non-transparent retries from the client?

My second question is: If it is not available, the numRetries is available when we construct the stats.Begin struct so if you are open to extending the stats.Begin struct, we can add it there.

Looking forward to your thoughts and also happy to implement the feature if it is not available

@davinci26 davinci26 added the Type: Feature New features or improvements in behavior label May 7, 2025
@arjan-bal arjan-bal added the Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability label May 8, 2025
@purnesh42H
Copy link
Contributor

purnesh42H commented May 8, 2025

My first question is: Is this actually true or is there a way to get o11y on the non-transparent retries from the client?

We don't have retry metrics yet. There is a proposal in progress https://github.com/grpc/proposal/pull/488/files. A metric label grpc.client.call.retries is proposed which has a label grpc.retry_type that can take one of three values - "retry", "hedge", or "transparent". Please take a look at the proposal to see if it has everything what you need. Tagging @yashykt in case you have more questions regarding the proposal.

In tracing though we have a way to identify if an rpc attempt was a transparent retry or not. When span is created, an attribute with key transparent-retry is set with a boolean value indicating whether the stream is undergoing a transparent retry. Tracing is still in experimental state. You can enable tracing through dial options https://github.com/grpc/grpc-go/blob/master/examples/features/opentelemetry/client/main.go#L65

@davinci26
Copy link
Author

In tracing though we have a way to identify if an rpc attempt was a transparent retry or not.

Seems reasonable, have to check our implementation of tracing to see if we include this or not but thanks for the pointer.

We don't have retry metrics yet. There is a proposal in progress https://github.com/grpc/proposal/pull/488/files. A metric label grpc.client.call.retries is proposed which has a label grpc.retry_type that can take one of three values - "retry", "hedge", or "transparent".

Cool proposal, you guys are one step ahead :D. Everything in the proposal makes sense.

Havent thought this fully so take this with a bit of a grain of salt but given the proposal above and the fact that the otel implementation in grpc-go is a wrapper around StatsHandler would it make sense to extend stats.Begin with:

  1. numRetries
  2. If an RPC is a hedge or not

That way consumers who have an implementation of the stats handler can implement similar metrics as well.

@purnesh42H
Copy link
Contributor

@davinci26 The OpenTelemetry (OTel) plugin provides trace attributes for RPC attempts that help in tracking retries. As you can see in https://github.com/grpc/grpc-go/blob/master/stats/opentelemetry/trace.go#L50-L51 (in the populateSpan function for *stats.Begin), these include:

  • previous-rpc-attempts: This attribute indeed indicates the number of prior attempts for the RPC. The OTel stats.Handler currently calculates this value internally (using ai.previousRPCAttempts).
  • transparent-retry: This boolean flag is directly sourced from stats.Begin.IsTransparentRetryAttempt and clearly signals if the attempt is a gRPC-handled transparent retry.

And yes, for retry-specific "metrics", there's a proposal in progress.

stats.Handler is experimental and for common observability requirements, it's not recommended as the primary solution for most users. The preferred path is to identify what's missing in the official plugins (like the OTel or OpenCensus ones) and work to add those features if they benefit a broad range of users. For users who need very specific, custom metrics that aren't covered by standard tooling, the stats.Handler interface is precisely the mechanism gRPC-Go provides for such advanced customization. Information like numRetries and the type of retry (e.g., standard vs. hedged) are common requirements. The goal is to make such essential data easily accessible, ideally out-of-the-box via the OTel plugin.

@davinci26
Copy link
Author

previous-rpc-attempts: This attribute indeed indicates the number of prior attempts for the RPC. The OTel stats.Handler currently calculates this value internally (using ai.previousRPCAttempts).

Does this include service level retries? I might be missing something super obvious here but I see that:

ai := &attemptInfo{
              ....
}

is added to the ctx when the RPC is tagged so retries would have a different ctx. Basically I cant understand how

atomic.AddUint32(&ai.previousRPCAttempts, 1)
can be called twice for the same RPC for non-transparent retries.

stats.Handler is experimental and for common observability requirements, it's not recommended as the primary solution for most users.

100% agree, stats.Handler is a pretty low-level interface that only makes sense for a limited subset of folks and 99% of the users would be better suited with the standard OTel plugin. But I am not sure if we are agreeing or not on the next bit. Should we add the numRetries in the stats.Begin struct? I can see this being helpful both for the OTel plugin as it would make the code easier and also easier for us few that work with the stats.Handler directly.

@purnesh42H
Copy link
Contributor

purnesh42H commented May 16, 2025

@davinci26 you are right about the problem with attemptInfo. Each attempt of rpc gets a new attempt info so its wrong to maintain the counter of previous rpc attempts within that. This is a bug. I think the right way to do this is to funnel the clientStream.numRetries all the way to HandleRPC and record the value in "previous-rpc-attempts". clientStream.numRetries exclusively counts non-transparent retries.

Once we do that, we don't have to make it separately available in stats.Begin. Although i guess its fine even if we add numRetries to stats.Begin as an exported field? @dfawley do you see any issue with exposing numRetries in stats.Begin?

@purnesh42H purnesh42H added Type: Bug P1 and removed Type: Feature New features or improvements in behavior labels May 16, 2025
@dfawley
Copy link
Member

dfawley commented May 16, 2025

Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC.

@purnesh42H
Copy link
Contributor

Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC.

What is the issue with funneling cs.numRetries? We need to also make sure to exclude transparent retries https://github.com/grpc/proposal/blob/master/A72-open-telemetry-tracing.md#tracing-information. clientStream.numRetries already have what we want.

@dfawley
Copy link
Member

dfawley commented May 16, 2025

Sorry, I missed the bit about excluding transparent retries. You can wait to increment when stats.Begin happens, then, and check that struct's IsTransparentRetryAttempt field.

What is the issue with funneling cs.numRetries?

Generally, adding things requires proof that it is either necessary or sufficiently convenient. So the question should be the other way around: what is the issue with doing it the way that is already possible?

Also...we don't support hedging yet, but when/if we do, those likely wouldn't be included in the clientStream.numRetries, and it would be wrong to not count hedging attempts in our tracing instrumentation.

@davinci26
Copy link
Author

davinci26 commented May 16, 2025

Generally, adding things requires proof that it is either necessary or sufficiently convenient. So the question should be the other way around: what is the issue with doing it the way that is already possible?

Let me code up the suggestion above so take this statement with 75% confidence but where my head is at is that this fairly complicated/non-ergonomic. I appreciate the aversion towards extending the api surface but Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC. and on top You can wait to increment when stats.Begin happens, then, and check that struct's IsTransparentRetryAttempt field. seems fairly easy to get it wrong for something that is pretty useful o11y for service owners. Am I retrying too much

Edit: I think where I am coming from is that I already got this wrong once and I couldnt think of a way to implement the retries hence my question and then we realized that even the grpc-go implementation didn't get this 100% right so maybe we can do better and make it easier for folks to implement this.

@dfawley
Copy link
Member

dfawley commented May 16, 2025

The stats package's API is mainly intended for internal usage or for power users. Optimal usability is not really a goal (it already has a ton of rough edges).

Our goal is that 99% of users should be able to use the OpenCensus or OpenTelemetry plugins, instead, and that those should be easy to configure and use.

@davinci26
Copy link
Author

Make sense @dfawley, thanks for the feedback and the pointer!

I still have my reservations:

  • Recommending to power users this approach via extending the context is going to create an implicit api for grpc-go that is much harder to test and modify on the OSS side. But I think this tradeoff is yours to decide, I am happy and I can support either outcome.
  • I think it would simplify the implementation of both OpenCensus/OpenTelemetry plugins + power users.

I think the main advantage on my side is that this method is available right now with the current release, so I ended up implementing it. It works well so far.

(issue is closed on my side, but I think it is used to track the bug on the OTEL plugin so leaving it open)

@vinothkumarr227
Copy link
Contributor

@purnesh42H I'm looking into this. Please assign the issue to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability P1 Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants