-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify and strip invalid Unicode characters #35038
Comments
Thanks for reaching out! You’ll likely receive a message from Lily in your ongoing support case, but I wanted to address this here as well for anyone else following along. Instead of using the V1 (json) endpoint, we recommend that you use the V2 (protobuf endpoint). Vector supports the V2 endpoint, as you can see in the code here. We recommend enabling the V2 metrics endpoint by setting |
OK. I think the V2 endpoint would likely still have the same issue, though, of accepting garbage in and passing garbage through - it just might end up deferring the issue to a later point in the stack, which seems unwise. While we can work with customer teams to ensure they are sending 100% valid Unicode/UTF-8 data in all instances, it would be easiest if we could strip out this data in the datadog-agent. Please allow an option to do this. |
The Agent's current approach is to send all received data to our backend, where further processing and validation occur. Modifying the Datadog Agent to strip out such data is not currently prioritized. However, we can log a feature request for future consideration. Have you tested the V2 endpoint, and if so, are you still encountering issues? If the issue persists, that information would help us better assess the priority of this request. If you can share the raw dogstatsd packet that was sent to the Agent and led to the failure, that would help us reproduce the issue. We attempted to reproduce it by sending tags with the |
Our team maintains a metrics pipeline for our company. Teams send metric data to us and we process it using the datadog-agent.
Recently, we replaced another component of our metric pipeline with the datadog agent, and shortly after this, suffered a metric data outage. It turned out that a customer team was sending metric data that included invalid Unicode code points. We sent this from the Datadog agent to a Vector instance using the /api/v1/series endpoint. The Vector instance uses serde-json in strict mode to parse incoming data: https://github.com/vectordotdev/vector/blob/master/src/sources/datadog_agent/metrics.rs#L416-L421
This led all of the metric data in the payload to get rejected, we would retry blindly, and eventually drop metrics.
I was sort of curious how this could happen since the standard library encoding/json.Marshal method replaces invalid Unicode code points with \ufffd, the Unicode replacement character.
It turns out that the datadog agent uses github.com/json-iterator/go to encode JSON in the fast path. Despite a promise of "100% compatibility with standard lib," this library does not handle invalid Unicode code points the same way as the standard library. Here is an open ticket from 2019 pointing out this behavior divergence; it's also easy to see by comparing the json-iterator method to the comparable standard library function. (I wrote some of the JSON escaping code in the Go standard library.) json-iterator/go#323
While we can work with customer teams to ensure they are sending 100% valid Unicode/UTF-8 data in all instances, it would be easiest if we could strip out this data in the datadog-agent. Please either allow an option to do this, or upgrade the JSON encoding library you use to one that is 1) maintained 2) compatible with the standard library.
The text was updated successfully, but these errors were encountered: