Fine-tuned inference performance guaranteed!
You've fine-tuned your models to make your agents behave and speak how you'd like. You've scaled up your RAG application to meet customer demand. You've now got a good problem: users love the service but want it snappier and more responsive.
Azure OpenAI Service now offers provisioned deployments for fine-tuned models, giving your applications predictable performance with predictable costs!
š” What is Provisioned Throughput?
If you're unfamiliar with Provisioned Throughput, it allows Azure OpenAI Service customers to purchase capacity in terms of performance needs instead of per-token. With fine-tuned deployments, it replaces both the hosting fee and the token-based billing of Standard and Global Standard (now in Public Preview) with a throughput-based capacity unit called provisioned through units (PTU).
Every PTU corresponds to a commitment of both latency and throughput in Tokens per Minute (TPM). This differs from Standard and Global Standard which only provide availability guarantees and best-effort performance.
With fine-tuned deployments, it replaces both the hosting fee and the token-based billing of Standard and Global Standard (now in Public Preview) with a throughput-based capacity unit called a PTU.
š¤ Is this the same PTU I'm already using?
You might already be using Provisioned Throughput Units with base models and with fine-tuned models they work the same way. In fact, they're completely interchangeable!
Already have quota in North Central US for 800 PTU and an annual Azure reservation rate? PTUs are interchangeable and model independent meaning you can get started with using them for fine-tuning immediately without any additional steps. Just select Provisioned Managed (Public Preview) from the model deployment dialog and set your PTU allotment.
š What's available in Public Preview?
We're offering provisioned deployment in two regions for both gpt-4o (2024-08-06) and gpt-4o-mini (2024-07-18) to support Azure OpenAI Service customers:
- North Central US
- Switzerland West
If your workload requires regions other than the above, please make sure to submit a request so we can consider it for General Availability. š
š How do I get started?
If you don't already have PTU quota from base models, the easiest way to get started and shifting your fine-tuned deployments to provisioned is:
- Understand your workload needs. Is it spiky but with a baseline demand? Review some of our previous materials on right-sizing PTUs (or have CoPilot summarize it for you š).
- Estimate the PTUs you need for your workload by using the calculator.
- Increase your regional PTU quota, if required.
- Deploy your fine-tuned models to secure your Provisioned Throughput capacity.
- Make sure to purchase an Azure Reservation to cover your PTU usage to save big.
- Have a spiky workload? Combine PTU and Standard/Global Standard and configure your architecture for spillover.
Have feedback as you continue on your PTU journey with Azure OpenAI Service? Let us know how we can make it better!
Updated Feb 26, 2025
Version 1.0davevoutila
Microsoft
Joined February 24, 2025
AI - Azure AI services Blog
Follow this blog board to get notified when there's new activity