xdsclient: race around resource subscriptions and unsubscsriptions #8125

easwars · 2025-02-25T22:38:35Z

We are seeing a race with the xDS client when the application has multiple grpc channels being created and shutdown at the same time, repeatedly. One possible way to get this race to happen is for an application to open multiple grpc channels to the same target (around the same time), make one RPC in each channel, and shut it down, and do this process over and over again.

The sequence of events that trigger the race is as follows:

Channel 1 is created to target foo. It goes through the whole xDS flow successfully and is being shut down.
At around the same time, Channel 2 is created to the same target foo.
As part of Channel 1 shutdown, it starts unsubscribing to the xDS resources.
The xDS client function to unsubscribe schedules a callback on the serializer. The function returns early, but the actual work of sending the ADS request to effect the unsubscription happens asynchronously. See:

grpc-go/xds/internal/xdsclient/authority.go

Line 680 in aa629e0

func (a *authority) unwatchResource(rType xdsresource.Type, resourceName string, watcher xdsresource.ResourceWatcher) func() {
When the scheduled callback runs, two things happens:
- the internal state in the ADS stream implementation though is updated to reflect the fact that the resource is no longer subscribed to. See:
  
  grpc-go/xds/internal/xdsclient/transport/ads/ads_stream.go
  
  Line 217 in aa629e0
  
  func (s *StreamImpl) Unsubscribe(typ xdsresource.Type, name string) {
- the resource is also removed from the xDS client cache because there are no more watchers. See:
  
  grpc-go/xds/internal/xdsclient/authority.go
  
  Line 715 in aa629e0
  
  delete(resources, resourceName)
Channel 2 starts out and gets a reference to the existing xDS client (when it attempts to create a new one). It registers a watch for the same LDS resource as channel 1.
The watch function in the xDS client sees that there is no such LDS resource in the cache (because it was deleted as part of the unsubscription by channel 1). So, it updates the cache and creates a placeholder for this resource. See:

grpc-go/xds/internal/xdsclient/authority.go

Line 625 in aa629e0

if state == nil {

.
The actual ADS request for the subscription will be sent out asynchronously through the ADS stream implementation.
By the time the ADS stream implementation gets around to write the LDS request on the wire (both for the unsubscription by channel 1 and the subscription by channel 2), it sees there there is a single LDS resource to be requested, and it also sees a nonce and a version for that resource type. This is the same nonce and version that was received when the same resource was requested as part of channel 1. So, such a request is sent out, the management server does not send a response back, because it already responded to the same request name with the same version and nonce on the same ADS stream.

At this point channel 2 is stuck until the LDS resource-not-found timer fires after 15s and from that point on RPCs on the channel start to fail.

The fundamental reason behind this race is that the resource is deleted from the xDS client cache even before it is sent out of the ADS stream. Also important to note is that some resource state is maintained in the ADS stream implementation, while some is maintained in the xDS client and these both can go out of sync.

The text was updated successfully, but these errors were encountered:

easwars · 2025-02-25T22:39:14Z

@purnesh42H

purnesh42H · 2025-05-22T19:17:40Z

Full debugging logs https://paste.googleplex.com/6320159020810240 from repro purnesh42H#16

purnesh42H · 2025-05-22T20:31:43Z

This repro demonstrates an issue where closing the first client (cc1) in a separate goroutine, after a full xDS lifecycle and a successful RPC, can lead to problems for a subsequent client (cc2).

Here's the breakdown:

cc1.Close() is called in a goroutine. This triggers xDS resource unsubscription callbacks, which clean up related caches and queue unsubscription requests. The callback then finishes.
Because cc1.Close() happens asynchronously, cc2 might have already added its subscription callbacks for the same xDS resources.
When cc2's subscription callbacks trigger, they find the caches empty (due to cc1's cleanup) and attempt to queue new subscription requests.
Crucially, the unsubscription requests from cc1 haven't completed yet. This means cc2's new subscription requests pick up the same versionInfo and nonce that cc1 had previously received.
As a result, the xDS management server sees no change in versionInfo and nonce from cc2 and doesn't send any updates.
The Unsubscribe call from cc1 doesn't clear this type-level versioning information in StreamImpl by default if the stream itself remains active.
This causes xds resolver to keep waiting for the xds resource and finally times out after watch expiry.

The key problem here is cc2 subscription request using the same version and nonce received by cc1.

purnesh42H · 2025-05-22T20:57:06Z

One way to solve this problem can be to modify StreamImpl.Unsubscribe so that if this was the last resource of this type being watched on this stream, clear the version. The next request for this type will be fresh, forcing the server to send the resource(s).

	if len(state.subscribedResources) == 0 {
		state.version = ""
	}

While the xds protocol spec says that the xDS client should send the last seen version after the initial request in the DiscoveryRequest https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#ack-nack-and-resource-type-instance-version, If the StreamImpl is no longer tracking any active subscriptions for a particular resource type, isn't it logical for it to "forget" the last known version for that type on that stream? At this point we have already finished cleaning up all the caches. When a new subscription for that type comes in, it's effectively a fresh start for that type from the stream's perspective.

purnesh42H · 2025-05-23T15:40:58Z

The race happens because even though the unsubscription from cc1 callback executes after re-subscription from cc2, there is no guarantee that the correct sequence of messages (unsubscription, then re-subscription) are sent. If re-subscription is sent without unsubscription request to complete, server will not send the resource because of the same version info from cc1.

So, a better solution would be to retain the cached resource until the unsubscription request is fully completed by the server. This approach ensures two outcomes:

If cc2 re-subscribes before the unsubscription for cc1 is fully processed by the server, cc2 can utilize the still-present cached resource.
If cc2 re-subscribes after the server has processed cc1's unsubscription, the server will recognize cc2's request as a new interest (despite potentially identical version_info for the resource type) and will resend the resource.

The gRPC C++ implementation addressed a similar issue in PR #38698, where the cleanup of unwatched resources (cache eviction) is performed just before sending the request message that effectively unsubscribes the resource from the xDS server.

However, I believe an even more reliable approach for grpc-go is to perform this cleanup after receiving a response from the management server that acknowledges the state change—specifically, at the end of processing in handleResourceUpdate (for an update that implicitly confirms the unsubscription) or handleResourceDoesNotExist (which explicitly confirms the resource is no longer being sent).

I've implemented this in the same repro PR purnesh42H@3240f37#diff-d83abbbb9296e17020a2bced581392dc6ca31c8100f8987bee1b96eb2cbc2402 This change has successfully fixed the test case that was failing due to this race condition. It's worth noting that this modification will alter the expectations for some existing tests, as we are no longer immediately deleting resources from the cache or closing xDS channels upon triggering an unwatch operation.

@easwars @dfawley

easwars added the Type: Bug label Feb 25, 2025

easwars self-assigned this Feb 25, 2025

easwars added the Area: xDS Includes everything xDS related, including LB policies used with xDS. label Feb 25, 2025

purnesh42H assigned purnesh42H and unassigned easwars Mar 19, 2025

This was referenced May 18, 2025

xds: repro for race around subscription unsubscription #8336

Closed

xds: repro for race around subscription unsubscription purnesh42H/grpc-go#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xdsclient: race around resource subscriptions and unsubscsriptions #8125

xdsclient: race around resource subscriptions and unsubscsriptions #8125

easwars commented Feb 25, 2025

easwars commented Feb 25, 2025

Uh oh!

purnesh42H commented May 22, 2025

Uh oh!

purnesh42H commented May 22, 2025 •

edited

Loading

Uh oh!

purnesh42H commented May 22, 2025 •

edited

Loading

Uh oh!

purnesh42H commented May 23, 2025

Uh oh!

xdsclient: race around resource subscriptions and unsubscsriptions #8125

xdsclient: race around resource subscriptions and unsubscsriptions #8125

Comments

easwars commented Feb 25, 2025

easwars commented Feb 25, 2025

Uh oh!

purnesh42H commented May 22, 2025

Uh oh!

purnesh42H commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

purnesh42H commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

purnesh42H commented May 23, 2025

Uh oh!

purnesh42H commented May 22, 2025 •

edited

Loading

purnesh42H commented May 22, 2025 •

edited

Loading