Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenAI Priority Load Balancer for Azure OpenAI #1626

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

simonkurtz-MSFT
Copy link
Contributor

@simonkurtz-MSFT simonkurtz-MSFT commented May 17, 2024

This PR introduces the openai-priority-loadbalancer as a native Python option to target one or more Azure OpenAI endpoints. Among the features of the load-balancer are:

  • Minimally necessary code and configuration to add abstracted load-balancing to the OpenAI Python API Library via a custom httpx client.
  • Priority-based load-balancing to address scenarios such as Provisioned Throughput Unit (PTU) over Consumption prioritization.
  • Respects Retry-After headers returned from Azure OpenAI to trigger a temporary open circuit for that endpoint.
  • Random distribution of Azure OpenAI requests across any available backends (non-429 && non-5xx status).
  • Automatic retries of failing requests across remaining available backends.
  • Return of 429 status to OpenAI Python API Library once all backends are exhausted. The Retry-After header value will be the lowest / soonest of all backends to facilitate a very likely successful retry by the OpenAI Python API Library as soon as possible.

Relevant links:


This PR can be merged after @pamelafox's approval.

@simonkurtz-MSFT
Copy link
Contributor Author

Hi @pamelafox & @kristapratico,

This is how the OpenAI Priority Load Balancer integrates. Nevermind the hard-coded backend and the location of the backends list in this PR. I don't intend to ask for a merge, but this was the best way to give you an idea of the setup.

If you have two AOAI instances with the same model, you can plug them both in and should see load-balancing.

@simonkurtz-MSFT
Copy link
Contributor Author

I brought up two AOAI instances and related assets and configured both instances as backends in app.py. Then I started to have a conversation.

image

image

Both backends are responding. It's important to note that this is not a uniform distribution because available backends are randomized (have to do so as part of multi-process workloads).

image

At no point did the conversation break down or showed any kind of error through the chat bot.

@pamelafox
Copy link
Collaborator

Cool! I made a few changes to the PR to make it a little easier to test out, by actually making the additional backend deployment, mind if I push them to the branch?

I think we should mention this option in the Productionizing guide, and if there are multiple customers wanting to use this approach, we could consider integrating it into main as an option.

@pamelafox
Copy link
Collaborator

Here are what my usage graphs look like during a load test btw:

Screenshot 2024-06-02 at 2 57 02 PM Screenshot 2024-06-02 at 2 57 05 PM

@simonkurtz-MSFT
Copy link
Contributor Author

Cool! I made a few changes to the PR to make it a little easier to test out, by actually making the additional backend deployment, mind if I push them to the branch?

I think we should mention this option in the Productionizing guide, and if there are multiple customers wanting to use this approach, we could consider integrating it into main as an option.

Hi Pamela, please do push! I very much welcome your expertise and improvements. If there are aspects of the 1.0.9 package itself that should/need to be improved, I'm all ears there, too, of course.

Thank you so much! I know this is extraordinary time spent.

@simonkurtz-MSFT
Copy link
Contributor Author

Here are what my usage graphs look like during a load test btw:

Help me understand your test results, please. Are you hitting different backends or just different models?

@simonkurtz-MSFT simonkurtz-MSFT marked this pull request as ready for review June 3, 2024 15:10
Copy link
Contributor Author

@simonkurtz-MSFT simonkurtz-MSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pamelafox, LGTM

@pamelafox
Copy link
Collaborator

@simonkurtz-MSFT Those graphs were for two different OpenAI instances in the same region.

@pamelafox
Copy link
Collaborator

@simonkurtz-MSFT Could you send a separate PR adding a mention of this approach to https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/productionizing.md#openai-capacity with a link to this PR? You could contrast when someone might opt for this over ACA/APIM (presumably cost/complexity).

@simonkurtz-MSFT simonkurtz-MSFT mentioned this pull request Jun 3, 2024
5 tasks
@simonkurtz-MSFT
Copy link
Contributor Author

Hi @pamelafox, could I trouble you for another review of this PR, please? Thank you very much for all your help!

scope: openAiResourceGroup
params: {
name: '${abbrs.cognitiveServicesAccounts}${resourceToken}-b2'
location: openAiResourceGroupLocation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonkurtz-MSFT Do your customers typically deploy backends in multiple regions or same region? @mattgotteiner is wondering if the location should be a second location.

Copy link
Contributor Author

@simonkurtz-MSFT simonkurtz-MSFT Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pamelafox & @mattgotteiner, that's a very important question. My customers almost exclusively deploy to multiple regions. Being able to define a second region would be helpful. If not defined, we could fall back to setting the second region to the value of the first region, if need be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants