You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
At the moment, when hosting a model with tritonserver and the TensorRTLLM backend, exposed prometheus metrics can be used to calculate TTFT (time to first token) and TPS (tokens per second). However, there's no way to distinguish between streamed and non-streamed responses.
Streamed: The model processes the request in a continuous stream, producing tokens as it goes.
TTFT is low because the first token is emitted relatively quickly.
TPS is stable because tokens are produced at a relatively constant rate.
These requests have a minimal impact on overall average metrics because their behaviour is consistent.
Non-Streamed:The model processes the entire request before emitting any tokens.
TTFT is high because there's a significant delay before the first token.
TPS is zero during the processing phase and then spikes very high when the response is finally released.
These requests skew the average TTFT upwards, making it difficult to detect genuine performance issues. They also create bursty TPS, making it hard to set effective alert thresholds.
This leads to a number of problems when configuring altering rules.
Non-streamed requests with their long initial processing times drastically inflate the average TTFT. This makes it difficult to set a single, meaningful threshold for alerts. If you set a threshold based on the combined average, you'll either:
Miss genuine performance issues with streamed requests (because the average is already high).
Get false positives from non-streamed requests (which are behaving as expected).
Non-streamed requests produce zero TPS until the very end, then suddenly burst with a high TPS value.
This makes it very difficult to use rate-based alerting.
A sudden spike in TPS may simply be the completion of a long non-streamed request, not a genuine performance issue.
Zero values will also cause issues with rate calculations, as a change from 0 to a large number will cause a large spike in the rate.
Describe the solution you'd like
The obvious answer is therefore to split how the metrics are collected, where the nature of the request made (streamed vs non-streamed) increments a different counter. The problem we have is that Tritonserver, whilst allowing for custom metrics, doesn't allow us to easily change the middleware to inspect the request as it passes through - if we want to do this, we'll need to write our own implementation of a middleware within the python backend, which in turn means maintaining our own version of tritonserver - with all the overhead associated with that.
Describe alternatives you've considered
At the moment, we're working around by sending streamed and non-streamed requests to different servers, with different altering configured for each, but this is an undesirable overhead - as it means we are spending more on resource and not getting maximum utilisation our of our instances.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
At the moment, when hosting a model with tritonserver and the TensorRTLLM backend, exposed prometheus metrics can be used to calculate TTFT (time to first token) and TPS (tokens per second). However, there's no way to distinguish between streamed and non-streamed responses.
Streamed: The model processes the request in a continuous stream, producing tokens as it goes.
Non-Streamed:The model processes the entire request before emitting any tokens.
This leads to a number of problems when configuring altering rules.
This makes it very difficult to use rate-based alerting.
Describe the solution you'd like
The obvious answer is therefore to split how the metrics are collected, where the nature of the request made (streamed vs non-streamed) increments a different counter. The problem we have is that Tritonserver, whilst allowing for custom metrics, doesn't allow us to easily change the middleware to inspect the request as it passes through - if we want to do this, we'll need to write our own implementation of a middleware within the python backend, which in turn means maintaining our own version of tritonserver - with all the overhead associated with that.
Describe alternatives you've considered
At the moment, we're working around by sending streamed and non-streamed requests to different servers, with different altering configured for each, but this is an undesirable overhead - as it means we are spending more on resource and not getting maximum utilisation our of our instances.
The text was updated successfully, but these errors were encountered: