AI - AI Platform Blog

5 MIN READ

Compare and select models with new benchmarking tools in Azure AI Foundry

jessicacioffi

Microsoft

Nov 19, 2024

Since last November, the model benchmarks experience in Azure AI Foundry has helped organizations discover, compare, and select GenAI models based on key performance metrics. Model benchmarks provide a curated list of the best-performing models for a given inference task (e.g. chat completion, summarization, or forecasting), helping developers identify top contenders for their application quickly.

This month, we are thrilled to announce improvements to the model benchmarks experience for even easier navigation and more comprehensive metrics and model coverage to help organizations find the best model for their unique needs. These updates include: direct integration within the Azure AI model catalog, new performance and cost metrics, and the ability to evaluate and compare models using your own private data. Let’s dive in!

Model benchmarks are now integrated into the Azure AI model catalog

We are excited to announce a completely refreshed model benchmarks experience in the Azure AI Foundry portal. Model benchmarks are now fully integrated into the Azure AI model catalog, allowing users to find and use benchmarks with greater ease as they explore vast model options. Models that provide benchmarking results are denoted with a bar graph icon (Fig. 1).

Fig 1. Find models in the Azure AI model catalog that have benchmark data available

When a user selects a model with benchmarking data available, the model's details page opens with a new "benchmarks" tab at the top. Clicking on this tab reveals benchmark results for the selected model and similar models for comparison (Fig. 2).

The benchmarks tab includes:

Index scores: Quickly assess the model’s generation quality, cost, latency, and throughput with average scores.
Comparative charts: Visualize the model's performance relative to other models.
Metric comparison table: View detailed results for each individual metric.

By default, an average score, or index, across metrics and datasets provides a high-level performance summary. To see specific metrics and datasets used to calculate the index score, users can simply click the expand button on any chart for a detailed view.

Fig 2. View benchmark results for a selected model and comparable models from within the model card

For users still exploring model options, the Compare with more models button on the main model catalog page provides a quick way to browse models with benchmark data. Here, users can examine each model's metrics and evaluate trade-offs across key performance criteria, including generation quality, cost, and latency (Fig. 3).

Fig 3. Compare the trade-offs between two metrics for each selected model

Users can also click the List view tab in the top right corner to view the relevant datasets in a list. This list provides clickable links to dataset details and score details for the individual benchmark runs (Fig. 4).

Fig 4. View benchmark details, including the model version and dataset used to derive a score

These streamlined views make it even easier for organizations to identify and select the models that best align with their specific requirements.

Introducing new performance and cost benchmarks

We are excited to announce an expansion of our supported benchmarks with the release of new performance and cost benchmarks. With the addition of metrics such as latency, throughput, and estimated cost, users can better-assess if a model is right for their scenario across diverse variables. Not only this, but with our comparison tool, users can visualize these comparisons very easily. Note that performance and cost benchmarks are updated on a regular basis, meaning they are subject to change.

Explore new performance benchmarks

Our performance benchmarks focus on measuring the latency and throughput of a model. Latency is the time it takes for a model to process a request, measured in seconds. Throughput is the amount of data that a model can process per second. These two measures are key to providing an overall picture of model efficiency for users.

The calculations of performance that we support are:

Latency mean
Latency (P50, P90, P95, P99)
Latency TTFT (Time to first token)
Throughput GTPS (Generated tokens per second)
Throughput TTPS (Total tokens per second)
Time between tokens

Find more detail about these metrics and how they are calculated in our documentation.

Explore new cost benchmarks

Our cost benchmarks can help users understand the approximate monetary cost of deploying and using a model without having to actually deploy it. Today, we support cost estimates for Models as a Service (MaaS) and Azure OpenAI Service models. These estimates represent model endpoints hosted by Azure AI. The calculations we support are:

Cost per input tokens (per 1M tokens)
Cost per output tokens (per 1M tokens)
Estimated cost

Find more detail about these metrics and how they are calculated in our documentation.

Indexes

With the addition of these granular performance and cost benchmarks on top of pre-existing benchmarks for quality, some customers may prefer to start with a top-level view that helps summarize each model’s performance within each metric category before diving into details. That’s why we provide a summary score, or index, for each benchmark (Fig. 5). The goal of these indexes is to act as a general guide to how a model performs in that category of measurement.

Index category	Interpretation
Quality	Calculated by scaling down GPTSimilarity between 0-1, followed by averaging with accuracy metrics. The higher, the better.
Latency	Mean latency, measured in time to first token (TTFT). The lower, the better.
Throughput	Mean throughput, measured in generated tokens per second (GTPS). The higher, the better.
Cost	Estimated Cost. The lower, the better.

Find more detail about these metrics and how they are calculated in our documentation.

Fig 5. Understand model performance at-a-glance with index scores

Evaluate and compare base models using your own test data

While benchmarks on public data can help organizations identify models that meet their general criteria, most organizations also want to evaluate and compare models on their own data to ensure the models are a good fit for their specific use case. We are excited to announce new tools that make it easier for users to evaluate base models and fine-tuned models using their own test data. After exploring benchmark results, users can now click Try with your own data to generate the same set of metrics using their own data (Fig. 6).

Learn more about new evaluation tools in Azure AI Foundry on TechCommunity.

Fig 6. Select "Try with your own data" to calculate the same metrics using test data specific to your use case

Customize, scale and manage GenAI apps with Azure AI Foundry

Don't miss these other exciting announcements from Microsoft Ignite to help you get started with the right model for your use case:

Explore new models in the Azure AI model catalog

Explore new evaluation tools to assess GenAI models and apps for quality and safety

Whether you’re joining in person or online, we can’t wait to see you at Microsoft Ignite 2024. We’ll be sharing the latest innovations from Azure AI and go deeper into best practices for model evaluation and selection with sessions like these:

Azure AI platform unlocking the AI revolution

Azure AI: Effortless model selection - explore, swap, and scale faster

Trustworthy AI: Advanced risk evaluation and mitigation

Updated Nov 16, 2024

Version 1.0

microsoft ignite 2024

model catalog