Retry Attempts are not available in `client-stats` #8299

davinci26 · 2025-05-07T19:48:27Z

Clients can configure non transparent retries either via XDS or via ServiceConfig however the stats handler have no good way to retrieve if a particular RPC method is a retry or not.

This type of observability is quite useful to determine if retry-storms(due to poor configurations) cause incidents and/or to see if retries are actually effective.

I thought of looking at the retry metadata header "grpc-previous-rpc-attempts" but this one is added by the http2_client as a header after the HandleRPC method is called so that didnt yield any results.

My first question is: Is this actually true or is there a way to get o11y on the non-transparent retries from the client?

My second question is: If it is not available, the numRetries is available when we construct the stats.Begin struct so if you are open to extending the stats.Begin struct, we can add it there.

Looking forward to your thoughts and also happy to implement the feature if it is not available

The text was updated successfully, but these errors were encountered:

purnesh42H · 2025-05-08T12:25:51Z

My first question is: Is this actually true or is there a way to get o11y on the non-transparent retries from the client?

We don't have retry metrics yet. There is a proposal in progress https://github.com/grpc/proposal/pull/488/files. A metric label grpc.client.call.retries is proposed which has a label grpc.retry_type that can take one of three values - "retry", "hedge", or "transparent". Please take a look at the proposal to see if it has everything what you need. Tagging @yashykt in case you have more questions regarding the proposal.

In tracing though we have a way to identify if an rpc attempt was a transparent retry or not. When span is created, an attribute with key transparent-retry is set with a boolean value indicating whether the stream is undergoing a transparent retry. Tracing is still in experimental state. You can enable tracing through dial options https://github.com/grpc/grpc-go/blob/master/examples/features/opentelemetry/client/main.go#L65

davinci26 · 2025-05-08T12:49:29Z

In tracing though we have a way to identify if an rpc attempt was a transparent retry or not.

Seems reasonable, have to check our implementation of tracing to see if we include this or not but thanks for the pointer.

We don't have retry metrics yet. There is a proposal in progress https://github.com/grpc/proposal/pull/488/files. A metric label grpc.client.call.retries is proposed which has a label grpc.retry_type that can take one of three values - "retry", "hedge", or "transparent".

Cool proposal, you guys are one step ahead :D. Everything in the proposal makes sense.

Havent thought this fully so take this with a bit of a grain of salt but given the proposal above and the fact that the otel implementation in grpc-go is a wrapper around StatsHandler would it make sense to extend stats.Begin with:

numRetries
If an RPC is a hedge or not

That way consumers who have an implementation of the stats handler can implement similar metrics as well.

purnesh42H · 2025-05-12T18:05:04Z

@davinci26 The OpenTelemetry (OTel) plugin provides trace attributes for RPC attempts that help in tracking retries. As you can see in https://github.com/grpc/grpc-go/blob/master/stats/opentelemetry/trace.go#L50-L51 (in the populateSpan function for *stats.Begin), these include:

previous-rpc-attempts: This attribute indeed indicates the number of prior attempts for the RPC. The OTel stats.Handler currently calculates this value internally (using ai.previousRPCAttempts).
transparent-retry: This boolean flag is directly sourced from stats.Begin.IsTransparentRetryAttempt and clearly signals if the attempt is a gRPC-handled transparent retry.

And yes, for retry-specific "metrics", there's a proposal in progress.

stats.Handler is experimental and for common observability requirements, it's not recommended as the primary solution for most users. The preferred path is to identify what's missing in the official plugins (like the OTel or OpenCensus ones) and work to add those features if they benefit a broad range of users. For users who need very specific, custom metrics that aren't covered by standard tooling, the stats.Handler interface is precisely the mechanism gRPC-Go provides for such advanced customization. Information like numRetries and the type of retry (e.g., standard vs. hedged) are common requirements. The goal is to make such essential data easily accessible, ideally out-of-the-box via the OTel plugin.

davinci26 · 2025-05-12T19:12:01Z

previous-rpc-attempts: This attribute indeed indicates the number of prior attempts for the RPC. The OTel stats.Handler currently calculates this value internally (using ai.previousRPCAttempts).

Does this include service level retries? I might be missing something super obvious here but I see that:

ai := &attemptInfo{
              ....
}

is added to the ctx when the RPC is tagged so retries would have a different ctx. Basically I cant understand how

grpc-go/stats/opentelemetry/trace.go

Line 54 in 709023d

atomic.AddUint32(&ai.previousRPCAttempts, 1)

can be called twice for the same RPC for non-transparent retries.

stats.Handler is experimental and for common observability requirements, it's not recommended as the primary solution for most users.

100% agree, stats.Handler is a pretty low-level interface that only makes sense for a limited subset of folks and 99% of the users would be better suited with the standard OTel plugin. But I am not sure if we are agreeing or not on the next bit. Should we add the numRetries in the stats.Begin struct? I can see this being helpful both for the OTel plugin as it would make the code easier and also easier for us few that work with the stats.Handler directly.

purnesh42H · 2025-05-16T19:29:05Z

@davinci26 you are right about the problem with attemptInfo. Each attempt of rpc gets a new attempt info so its wrong to maintain the counter of previous rpc attempts within that. This is a bug. I think the right way to do this is to funnel the clientStream.numRetries all the way to HandleRPC and record the value in "previous-rpc-attempts". clientStream.numRetries exclusively counts non-transparent retries.

Once we do that, we don't have to make it separately available in stats.Begin. Although i guess its fine even if we add numRetries to stats.Begin as an exported field? @dfawley do you see any issue with exposing numRetries in stats.Begin?

dfawley · 2025-05-16T19:39:26Z

Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC.

purnesh42H · 2025-05-16T19:53:55Z

Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC.

What is the issue with funneling cs.numRetries? We need to also make sure to exclude transparent retries https://github.com/grpc/proposal/blob/master/A72-open-telemetry-tracing.md#tracing-information. clientStream.numRetries already have what we want.

dfawley · 2025-05-16T20:03:11Z

Sorry, I missed the bit about excluding transparent retries. You can wait to increment when stats.Begin happens, then, and check that struct's IsTransparentRetryAttempt field.

What is the issue with funneling cs.numRetries?

Generally, adding things requires proof that it is either necessary or sufficiently convenient. So the question should be the other way around: what is the issue with doing it the way that is already possible?

Also...we don't support hedging yet, but when/if we do, those likely wouldn't be included in the clientStream.numRetries, and it would be wrong to not count hedging attempts in our tracing instrumentation.

davinci26 · 2025-05-16T20:22:38Z

Generally, adding things requires proof that it is either necessary or sufficiently convenient. So the question should be the other way around: what is the issue with doing it the way that is already possible?

Let me code up the suggestion above so take this statement with 75% confidence but where my head is at is that this fairly complicated/non-ergonomic. I appreciate the aversion towards extending the api surface but Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC. and on top You can wait to increment when stats.Begin happens, then, and check that struct's IsTransparentRetryAttempt field. seems fairly easy to get it wrong for something that is pretty useful o11y for service owners. Am I retrying too much

Edit: I think where I am coming from is that I already got this wrong once and I couldnt think of a way to implement the retries hence my question and then we realized that even the grpc-go implementation didn't get this 100% right so maybe we can do better and make it easier for folks to implement this.

dfawley · 2025-05-16T21:47:55Z

The stats package's API is mainly intended for internal usage or for power users. Optimal usability is not really a goal (it already has a ton of rough edges).

Our goal is that 99% of users should be able to use the OpenCensus or OpenTelemetry plugins, instead, and that those should be easy to configure and use.

davinci26 · 2025-05-18T22:42:21Z

Make sense @dfawley, thanks for the feedback and the pointer!

I still have my reservations:

Recommending to power users this approach via extending the context is going to create an implicit api for grpc-go that is much harder to test and modify on the OSS side. But I think this tradeoff is yours to decide, I am happy and I can support either outcome.
I think it would simplify the implementation of both OpenCensus/OpenTelemetry plugins + power users.

I think the main advantage on my side is that this method is available right now with the current release, so I ended up implementing it. It works well so far.

(issue is closed on my side, but I think it is used to track the bug on the OTEL plugin so leaving it open)

vinothkumarr227 · 2025-05-19T05:53:23Z

@purnesh42H I'm looking into this. Please assign the issue to me.

davinci26 added the Type: Feature New features or improvements in behavior label May 7, 2025

arjan-bal assigned purnesh42H May 8, 2025

arjan-bal added the Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability label May 8, 2025

purnesh42H added the Status: Requires Reporter Clarification label May 8, 2025

purnesh42H assigned davinci26 and unassigned purnesh42H May 8, 2025

arjan-bal assigned purnesh42H and unassigned davinci26 May 9, 2025

arjan-bal removed the Status: Requires Reporter Clarification label May 9, 2025

purnesh42H assigned davinci26 and unassigned purnesh42H May 12, 2025

purnesh42H added the Status: Requires Reporter Clarification label May 12, 2025

arjan-bal assigned purnesh42H and unassigned davinci26 May 13, 2025

arjan-bal removed the Status: Requires Reporter Clarification label May 13, 2025

purnesh42H added Type: Bug P1 and removed Type: Feature New features or improvements in behavior labels May 16, 2025

purnesh42H assigned vinothkumarr227 and unassigned purnesh42H May 19, 2025

vinothkumarr227 linked a pull request May 19, 2025 that will close this issue

stats/opentelemetry: record retry attempts from clientStream #8342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retry Attempts are not available in `client-stats` #8299

Retry Attempts are not available in `client-stats` #8299

davinci26 commented May 7, 2025

purnesh42H commented May 8, 2025 •

edited

Loading

Uh oh!

davinci26 commented May 8, 2025

Uh oh!

purnesh42H commented May 12, 2025

Uh oh!

davinci26 commented May 12, 2025

Uh oh!

purnesh42H commented May 16, 2025 •

edited

Loading

Uh oh!

dfawley commented May 16, 2025

Uh oh!

purnesh42H commented May 16, 2025

Uh oh!

dfawley commented May 16, 2025

Uh oh!

davinci26 commented May 16, 2025 •

edited

Loading

Uh oh!

dfawley commented May 16, 2025

Uh oh!

davinci26 commented May 18, 2025

Uh oh!

vinothkumarr227 commented May 19, 2025

Uh oh!

Retry Attempts are not available in client-stats #8299

Retry Attempts are not available in client-stats #8299

Comments

davinci26 commented May 7, 2025

purnesh42H commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davinci26 commented May 8, 2025

Uh oh!

purnesh42H commented May 12, 2025

Uh oh!

davinci26 commented May 12, 2025

Uh oh!

purnesh42H commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dfawley commented May 16, 2025

Uh oh!

purnesh42H commented May 16, 2025

Uh oh!

dfawley commented May 16, 2025

Uh oh!

davinci26 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dfawley commented May 16, 2025

Uh oh!

davinci26 commented May 18, 2025

Uh oh!

vinothkumarr227 commented May 19, 2025

Uh oh!

Retry Attempts are not available in `client-stats` #8299

Retry Attempts are not available in `client-stats` #8299

purnesh42H commented May 8, 2025 •

edited

Loading

purnesh42H commented May 16, 2025 •

edited

Loading

davinci26 commented May 16, 2025 •

edited

Loading