Identify and strip invalid Unicode characters #35038

kevinburkesegment · 2025-03-11T23:43:16Z

Our team maintains a metrics pipeline for our company. Teams send metric data to us and we process it using the datadog-agent.

Recently, we replaced another component of our metric pipeline with the datadog agent, and shortly after this, suffered a metric data outage. It turned out that a customer team was sending metric data that included invalid Unicode code points. We sent this from the Datadog agent to a Vector instance using the /api/v1/series endpoint. The Vector instance uses serde-json in strict mode to parse incoming data: https://github.com/vectordotdev/vector/blob/master/src/sources/datadog_agent/metrics.rs#L416-L421

This led all of the metric data in the payload to get rejected, we would retry blindly, and eventually drop metrics.

I was sort of curious how this could happen since the standard library encoding/json.Marshal method replaces invalid Unicode code points with \ufffd, the Unicode replacement character.

It turns out that the datadog agent uses github.com/json-iterator/go to encode JSON in the fast path. Despite a promise of "100% compatibility with standard lib," this library does not handle invalid Unicode code points the same way as the standard library. Here is an open ticket from 2019 pointing out this behavior divergence; it's also easy to see by comparing the json-iterator method to the comparable standard library function. (I wrote some of the JSON escaping code in the Go standard library.) json-iterator/go#323

While we can work with customer teams to ensure they are sending 100% valid Unicode/UTF-8 data in all instances, it would be easiest if we could strip out this data in the datadog-agent. Please either allow an option to do this, or upgrade the JSON encoding library you use to one that is 1) maintained 2) compatible with the standard library.

mwdd146980 · 2025-03-14T17:41:55Z

Hi @kevinburkesegment,

Thanks for reaching out! You’ll likely receive a message from Lily in your ongoing support case, but I wanted to address this here as well for anyone else following along.

Instead of using the V1 (json) endpoint, we recommend that you use the V2 (protobuf endpoint). Vector supports the V2 endpoint, as you can see in the code here. We recommend enabling the V2 metrics endpoint by setting use_v2_api.series to true, as V2 is our current standard.

kevinburkesegment · 2025-03-14T18:12:26Z

OK. I think the V2 endpoint would likely still have the same issue, though, of accepting garbage in and passing garbage through - it just might end up deferring the issue to a later point in the stack, which seems unwise.

While we can work with customer teams to ensure they are sending 100% valid Unicode/UTF-8 data in all instances, it would be easiest if we could strip out this data in the datadog-agent. Please allow an option to do this.

mwdd146980 · 2025-03-18T17:41:24Z

While we can work with customer teams to ensure they are sending 100% valid Unicode/UTF-8 data in all instances, it would be easiest if we could strip out this data in the datadog-agent.

The Agent's current approach is to send all received data to our backend, where further processing and validation occur. Modifying the Datadog Agent to strip out such data is not currently prioritized. However, we can log a feature request for future consideration.

Have you tested the V2 endpoint, and if so, are you still encountering issues? If the issue persists, that information would help us better assess the priority of this request.

If you can share the raw dogstatsd packet that was sent to the Agent and led to the failure, that would help us reproduce the issue. We attempted to reproduce it by sending tags with the \xa8 character, but it was not malformed. Alternatively, if you can confirm that the rejection is occurring on our backend in addition to Vector's, that would further help us evaluate the request's priority.

This was referenced Mar 12, 2025

Strip invalid Unicode from emitted metric names, values, tag values segmentio/stats#186

Open

RFC: Handling of (non-UTF-8) byte payloads in Vector and VRL vectordotdev/vector#11577

Open

sgnn7 added the team/agent-metric-pipelines label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify and strip invalid Unicode characters #35038

Identify and strip invalid Unicode characters #35038

kevinburkesegment commented Mar 11, 2025

mwdd146980 commented Mar 14, 2025

kevinburkesegment commented Mar 14, 2025

mwdd146980 commented Mar 18, 2025 •

edited

Loading

Identify and strip invalid Unicode characters #35038

Identify and strip invalid Unicode characters #35038

Comments

kevinburkesegment commented Mar 11, 2025

mwdd146980 commented Mar 14, 2025

kevinburkesegment commented Mar 14, 2025

mwdd146980 commented Mar 18, 2025 • edited Loading

mwdd146980 commented Mar 18, 2025 •

edited

Loading