Blog Post

AI - Azure AI services Blog
4 MIN READ

Azure APIM Cost Rate Limiting with Cosmos & Flex Functions

hieunhu's avatar
hieunhu
Icon for Microsoft rankMicrosoft
Mar 05, 2025

Azure API Management (APIM) provides built-in rate limiting policies, but implementing sophisticated Dollar cost quota management for Azure OpenAI services requires a more tailored approach. This solution combines Azure Functions, Cosmos DB, and stored procedures to implement cost-based quota management with automatic renewal periods.

Architecture

Client → APIM (with RateLimitConfig) → Azure Function Proxy → Azure OpenAI
                                                                                 ↓
                                                             Cosmos DB (quota tracking)

Technical Implementation

1. Rate Limit Configuration in APIM

The rate limiting configuration is injected into the request body by APIM using a policy fragment. Here's an example for a basic $5 quota:

<set-variable name="rateLimitConfig" value="@{
    var productId = context.Product.Id;
    var config = new JObject();
    config["counterKey"] = productId;
    config["quota"] = 5; 
    return config.ToString();
}" />
<include-fragment fragment-id="RateLimitConfig" />
 

For more advanced scenarios, you can customize token costs. Here's an example for a $10 quota with custom token pricing:

 
<set-variable name="rateLimitConfig" value="@{
    var productId = context.Product.Id;
    var config = new JObject();
    config["counterKey"] = productId;
    config["startDate"] = "2025-03-02T00:00:00Z";
    config["renewal_period"] = 86400;
    config["explicitEndDate"] = null;
    config["quota"] = 10;
    config["input_cost_per_token"] = 0.00003;
    config["output_cost_per_token"] = 0.00006;
    return config.ToString();
}" />
<include-fragment fragment-id="RateLimitConfig" />

Flexible Counter Keys

The counterKey parameter is highly flexible and can be set to any unique identifier that makes sense for your rate limiting strategy:

  • Product ID: Limit all users of a specific APIM product (e.g., "starter", "professional")
  • User ID: Apply individual limits per user
  • Subscription ID: Track usage at the subscription level
  • Custom combinations: Combine identifiers for granular control (e.g., "product_starter_user_12345")

Rate Limit Configuration Parameters

ParameterDescriptionExample ValueRequired
counterKeyUnique identifier for tracking quota usage"starter10" or "user_12345"Yes
quotaMaximum cost allowed in the renewal period10Yes
startDateWhen the quota period begins. If not provided, the system uses the time when the policy is first applied"2025-03-02T00:00:00Z"No
renewal_periodSeconds until quota resets (86400 = daily). If not provided, no automatic reset occurs86400No
endDateOptional end date for the quota periodnull or "2025-12-31T23:59:59Z"No
input_cost_per_tokenCustom cost per input token0.00003No
output_cost_per_tokenCustom cost per output token0.00006No

Scheduling and Time Windows

The time-based parameters work together to create flexible quota schedules:

  • If the current date falls outside the range defined by startDate and endDate, requests will be rejected with an error
  • The renewal window begins either on the specified startDate or when the policy is first applied
  • The renewal_period determines how frequently the accumulated cost resets to zero
  • Without a renewal_period, the quota accumulates indefinitely until the endDate is reached

2. Quota Checking and Cost Tracking

The Azure Function performs two key operations:

  1. Pre-request quota check: Before processing each request, it verifies if the user has exceeded their quota
  2. Post-request cost tracking: After a successful request, it calculates the cost and updates the accumulated usage

Cost Calculation

For cost calculation, the system uses:

  • Custom pricing: If input_cost_per_token and output_cost_per_token are provided in the rate limit config
  • LiteLLM pricing: If custom pricing is not specified, the system falls back to LiteLLM's model prices for accurate cost estimation based on the model being used

The function returns appropriate HTTP status codes and headers:

  • HTTP 429 (Too Many Requests) when quota is exceeded
  • Response headers with usage information:
    x-counter-key: starter5
    x-accumulated-cost: 5.000915
    x-quota: 5

3. Cosmos DB for State Management

Cosmos DB maintains the quota state with documents that track:

{
    "id": "starter5",
    "counterKey": "starter5",
    "accumulatedCost": 5.000915,
    "startDate": "2025-03-02T00:00:00.000Z",
    "renewalPeriod": 86400,
    "renewalStart": 1741132800000,
    "endDate": null,
    "quota": 5
}

 

A stored procedure handles atomic updates to ensure accurate tracking, including:

  • Adding costs to the accumulated total
  • Automatically resetting costs when the renewal period is reached
  • Updating quota values when configuration changes

Benefits

  1. Fine-grained Cost Control: Track actual API usage costs rather than just request counts
  2. Flexible Quotas: Set daily, weekly, or monthly quotas with automatic renewal
  3. Transparent Usage: Response headers provide real-time quota usage information
  4. Product Differentiation: Different APIM products can have different quota levels
  5. Custom Pricing: Override default token costs for special pricing tiers
  6. Flexible Tracking: Use any identifier as the counter key for versatile quota management
  7. Time-based Scheduling: Define active periods and automatic reset windows for quota management

Getting Started

  1. Deploy the Azure Function with Cosmos DB integration
  2. Configure APIM policies to include rate limit configuration
  3. Set up different product policies for various quota levels

For a detailed implementation, visit our GitHub repository.

Demo Video: https://www.youtube.com/watch?v=vMX86_XpSAo

 

Tags: #AzureOpenAI #APIM #CosmosDB #RateLimiting #Serverless

Updated Mar 06, 2025
Version 5.0
  • MarcusLow77's avatar
    MarcusLow77
    Copper Contributor

    Good sharing Hieu. APIM is already posing substantial cost (though the objective is aim at resilliency), anychance we can skip Azure Function to reduce this layer?

    • hieunhu's avatar
      hieunhu
      Icon for Microsoft rankMicrosoft

      Hi Marcus,

      Thanks for the feedback. You might consider the APIM Standard v2 tier—it offers VNET integration at a lower cost point. The Azure Function I’m using handles cost lookups, latency measurement, and real-time quota checks. While it’s theoretically possible to embed some of that logic within APIM policies, doing so makes debugging more challenging compared to using a full programming language. Moreover, streaming token usage can have delayed reporting in Application Insights, which can post some issues on enforcing quotas in real time at scale.

      I’m currently using the Flex Consumption tier for Azure Functions. It comes with a generous free grant (250,000 executions and 100,000 GB-s per month) and supports VNET integration, making it a sustainable choice overall.

      Hope that helps clarify things.
      Hieu