Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: distinct prometheus metrics for streamed vs non-streamed requests #8063

Open
MadDanWithABox opened this issue Mar 12, 2025 · 0 comments

Comments

@MadDanWithABox
Copy link

Is your feature request related to a problem? Please describe.
At the moment, when hosting a model with tritonserver and the TensorRTLLM backend, exposed prometheus metrics can be used to calculate TTFT (time to first token) and TPS (tokens per second). However, there's no way to distinguish between streamed and non-streamed responses.

Streamed: The model processes the request in a continuous stream, producing tokens as it goes.

  • TTFT is low because the first token is emitted relatively quickly.
  • TPS is stable because tokens are produced at a relatively constant rate.
  • These requests have a minimal impact on overall average metrics because their behaviour is consistent.

Non-Streamed:The model processes the entire request before emitting any tokens.

  • TTFT is high because there's a significant delay before the first token.
  • TPS is zero during the processing phase and then spikes very high when the response is finally released.
  • These requests skew the average TTFT upwards, making it difficult to detect genuine performance issues. They also create bursty TPS, making it hard to set effective alert thresholds.

This leads to a number of problems when configuring altering rules.

  1. Non-streamed requests with their long initial processing times drastically inflate the average TTFT. This makes it difficult to set a single, meaningful threshold for alerts. If you set a threshold based on the combined average, you'll either:
  • Miss genuine performance issues with streamed requests (because the average is already high).
  • Get false positives from non-streamed requests (which are behaving as expected).
  1. Non-streamed requests produce zero TPS until the very end, then suddenly burst with a high TPS value.
    This makes it very difficult to use rate-based alerting.
  • A sudden spike in TPS may simply be the completion of a long non-streamed request, not a genuine performance issue.
  • Zero values will also cause issues with rate calculations, as a change from 0 to a large number will cause a large spike in the rate.

Describe the solution you'd like
The obvious answer is therefore to split how the metrics are collected, where the nature of the request made (streamed vs non-streamed) increments a different counter. The problem we have is that Tritonserver, whilst allowing for custom metrics, doesn't allow us to easily change the middleware to inspect the request as it passes through - if we want to do this, we'll need to write our own implementation of a middleware within the python backend, which in turn means maintaining our own version of tritonserver - with all the overhead associated with that.

Describe alternatives you've considered
At the moment, we're working around by sending streamed and non-streamed requests to different servers, with different altering configured for each, but this is an undesirable overhead - as it means we are spending more on resource and not getting maximum utilisation our of our instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant