Blog Post

AI - Machine Learning Blog
11 MIN READ

Unlocking Function Calling with vLLM and Azure Machine Learning

alisoliman's avatar
alisoliman
Icon for Microsoft rankMicrosoft
Jan 31, 2025

Introduction

In this post, we’ll explain how to deploy LLMs on vLLM using Azure Machine Learning’s Managed Online Endpoints for efficient, scalable, and secure real-time inference. Next, we will look at function calling, and how vLLM's engine can support you to achieve that. To get started, let’s briefly look into what vLLM and Managed Online Endpoints are.

You can find the full code examples on vllm-on-azure-machine-learning.

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine designed for large language models (LLMs). It optimizes the serving and execution of LLMs by utilizing advanced memory management techniques, such as PagedAttention, which efficiently manages attention key and value memory. This allows for continuous batching of incoming requests and fast model execution, making vLLM a powerful tool for deploying and serving LLMs at scale.

vLLM supports seamless integration with popular Hugging Face models and offers various decoding algorithms, including parallel sampling and beam search. It also supports tensor parallelism and pipeline parallelism for distributed inference, making it a flexible and easy-to-use solution for LLM inference (see full docs).

Managed Online Endpoints in Azure Machine Learning

Managed Online Endpoints in Azure Machine Learning provide a streamlined and scalable way to deploy machine learning models for real-time inference. These endpoints handle the complexities of serving, scaling, securing, and monitoring models, allowing us to focus on building and improving your models without worrying about infrastructure management.

HuggingFace Model Deployment

Let’s go through deploying a HuggingFace model on Azure Machine Learning’s Managed Online Endpoints. For this, we’ll use a custom Dockerfile and configuration files to set up the deployment. As a model, we’ll be using meta-llama/Llama-3.1-8B-Instruct on a single Standard_NC24ads_A100_v4 instance.

Step 1: Create a custom Environment for vLLM on AzureML

First, we create a Dockerfile to define the environment for our model. For this, we’ll be using vllm’s base container that has all the dependencies and drivers included:

FROM vllm/vllm-openai:latest

ENV MODEL_NAME facebook/opt-125m

ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_NAME $VLLM_ARGS

The idea here is that we can pass a model name via an ENV variable, so that we can easily define which model we want to deploy during deployment time.

Next, we log into our Azure Machine Learning workspace:

az account set --subscription <subscription ID>
az configure --defaults workspace=<Azure Machine Learning workspace name> group=<resource group>
Now, we create an environment.yml file to specify the environment settings:
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: vllm
build:
  path: .
  dockerfile_path: Dockerfile

Then let’s build the environment:

az ml environment create -f environment.yml

Step 2: Deploy the AzureML Managed Online Endpoint

Time for deployment, so let’s first create an endpoint.yml file to define the Managed Online Endpoint:

$schema: https://azuremlsdk2.blob.core.windows.net/latest/managedOnlineEndpoint.schema.json
name: vllm-hf
auth_mode: key

Let’s create it:

az ml online-endpoint create -f endpoint.yml

For the next step, we’ll need the address of the Docker image address we created. We can quickly get it from AzureML Studio -> Environments -> vllm:

Finally, we create a `deployment.yml file to configure the deployment settings and deploy our desired model from HuggingFace via vLLM:

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: current
endpoint_name: vllm-hf
environment_variables:
  MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct # define the model name using the identifier from HG
  VLLM_ARGS: "--enable-auto-tool-choice --tool-call-parser llama3_json"
  HUGGING_FACE_HUB_TOKEN: xxxxxxxxxxxxxx # Add your HF API key here
environment:
  image: xxxxxxxxx.azurecr.io/azureml/azureml_xxxxxxxxxxx # Replace with your own image
  inference_config:
    liveness_route:
      port: 8000
      path: /ping 
    readiness_route:
      port: 8000
      path: /health
    scoring_route:
      port: 8000
      path: /
instance_type: Standard_NC24ads_A100_v4
instance_count: 1
request_settings:
    request_timeout_ms: 60000
    max_concurrent_requests_per_instance: 16 
liveness_probe:
  initial_delay: 10
  period: 10
  timeout: 2
  success_threshold: 1
  failure_threshold: 30
readiness_probe:
  initial_delay: 120
  period: 10
  timeout: 2
  success_threshold: 1
  failure_threshold: 30

Since vLLM does not support separate probes for readiness and liveness, we’ll need to make sure that the model has fully loaded before the fire the first probe. This is why we increased readiness_probe.initial_delay to 120s. For larger models, we should also follow vLLM’s documentation for using tensor parallel inference (model on single node but spanning multiple GPUs) by adding --tensor-parallel-size <NUM_OF_GPUs> to VLLM_ARGS. Since we’re using a single A100 GPU in our example (Standard_NC24ads_A100_v4), this is not required though.

The request_settings depend a bit on our instance type/size and might require some manual tuning to get the model run properly and efficiently. Goal is to find a good tradeoff between concurrency (max_concurrent_requests_per_instance) and queue time in order to avoid either hitting request_timeout_ms from the endpoint side, or any HTTP-timeouts on the client side. Both these scenarios result in HTTP 429, and the client would need to implement exponential backoff (e.g. via tenacity library).

Lastly, we can deploy the model:

az ml online-deployment create -f deployment.yml --all-traffic

By following these steps, we have deployed a HuggingFace model on Azure Machine Learning’s Managed Online Endpoints, ensuring efficient and scalable real-time inference. Time to test it!

Step 3: Testing the deployment#

First, let’s get the endpoint’s scoring uri and the api keys:

az ml online-endpoint show -n vllm-hf
az ml online-endpoint get-credentials -n vllm-hf

For completion models, we can then call the endpoint using this Python code snippet:

import requests

url = "https://vllm-hf.polandcentral.inference.ml.azure.com/v1/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer xxxxxxxxxxxx"
}
data = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 200,
    "temperature": 0.7
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Response:

{
    "id": "cmpl-98d658cf-6310-4c87-a24f-723dda6db176",
    "object": "text_completion",
    "created": 1738267352,
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "choices": [
        {
            "index": 0,
            "text": " top tourist destination known for its iconic Golden Gate Bridge, steep hills, vibrant neighborhoods, and cultural attractions. The city is a haven for foodies, with a diverse range of cuisines available, from seafood to Mexican to Chinese and more.\nOne of the best ways to experience San Francisco is by taking a ride on a historic cable car, which offers stunning views of the city and its surroundings. Explore the historic Fisherman's Wharf, a bustling waterfront district filled with seafood restaurants, street performers, and souvenir shops.\nVisit the vibrant neighborhoods of Haight-Ashbury and the Mission District, known for their colorful street art, independent shops, and lively music scenes. Take a stroll through Golden Gate Park, a sprawling urban park that features gardens, lakes, and walking and biking trails.\n\nThe city has a thriving arts and culture scene, with numerous museums, galleries, and performance venues. The San Francisco Museum of Modern Art (SFMOMA) is one of the largest modern art museums in",
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 5,
        "total_tokens": 205,
        "completion_tokens": 200,
        "prompt_tokens_details": null
    }
}

Works!

Function Calling

Function calling in the context of large language models (LLMs) refers to the model's ability to dynamically generate and call structured functions based on context, user input, or specific task requirements. It enables seamless interaction with APIs, databases, or external tools while leveraging the model's reasoning capabilities.

vLLM provides an OpenAI-compatible server that supports the Completions, Chat Completions, and Embeddings APIs. For instance, it enables developers to seamlessly integrate models into existing workflows. Developers can use the official OpenAI Python client or any HTTP client to interact with vLLM, making it straightforward to integrate into existing workflows.

Before running the code, ensure you have the OpenAI library installed by executing:

pip install openai

The following code demonstrates the function-calling capabilities of vLLM using an example where the assistant retrieves information about historical events based on a provided date:

Lets go through it step by step

1. Defining a Custom Function: A query_historical_event function is defined, containing a dictionary of fictional historical events. This function serves as a callable endpoint for vLLM to retrieve information based on a user-specified date.

def query_historical_event(date):
    fictional_historical_events = {
        "1805-03-21": "On March 21, 1805, the Treaty of Varis signed by several European powers established the first coordinated effort to protect migratory bird species.",
        "1898-07-10": "On July 10, 1898, the Great Illumination Act was passed in London, mandating the installation of electric streetlights across all major cities in the United Kingdom.",
        "1923-09-05": "On September 5, 1923, the International Academy of Innovation was founded in Zurich, Switzerland, promoting global collaboration in scientific research.",
        "1940-02-14": "On February 14, 1940, the first underwater train tunnel connecting two countries was completed between France and the United Kingdom.",
        "1954-11-08": "On November 8, 1954, the Global Weather Watch Program was launched, pioneering the use of satellites for monitoring Earth's climate systems.",
        "1977-06-30": "On June 30, 1977, the first fully solar-powered town, Solaria, was inaugurated in Arizona, setting a benchmark for renewable energy communities.",
        "1983-12-12": "On December 12, 1983, the Universal Language Project introduced a simplified global auxiliary language intended to foster cross-cultural communication.",
        "1994-04-23": "On April 23, 1994, the Oceanic Research Pact was signed, marking a commitment by 40 nations to share oceanographic research and preserve marine ecosystems.",
        "2009-08-15": "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours.",
        "2020-01-10": "On January 10, 2020, the World Clean Air Initiative achieved its milestone goal of reducing urban air pollution levels in 50 major cities globally."
    }
    return fictional_historical_events.get(date, f"No historical event information available for {date}.")

2. Tool Integration: The function is wrapped in a tools definition, which includes metadata such as the function’s name, description, and expected parameters (e.g., the date in YYYY-MM-DD format).

tools = [
    {
        "function": {
            "name": "query_historical_event",
            "description": "Provides information about a historical event that occurred on a specified date.",
            "parameters": {
                "type": "object",
                "properties": {
                    "date": {
                        "type": "string",
                        "description": "The date of the event in YYYY-MM-DD format."
                    },
                },
                "required": ["date"]
            }
        }
    }
]

3. Conversation Workflow:

  • The conversation starts with a system message setting the assistant's role and a user query about a specific date.
  • The assistant evaluates the query and decides if the custom function is needed.
messages = [
    {"role": "system", "content": "You are a knowledgeable assistant that can retrieve information about historical events."},
    {"role": "user", "content": "Can you tell me what happened on August 15, 2009?"},
]

4. Function Call Handling: If the assistant determines that the function is required, it:

  • Parses the function call and extracts the necessary parameters (e.g., date).
  • Executes the query_historical_event function with the provided arguments and returns the result to the user.
chat_response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages,
    temperature=0.7,
    max_tokens=1024,
    top_p=0.9,
    frequency_penalty=0.5,
    presence_penalty=0.6,
    tools=tools,
    tool_choice='auto'
)

if chat_response.choices[0].message.tool_calls:
    date_argument = json.loads(
        chat_response.choices[0].message.tool_calls[0].function.arguments)
    date = date_argument.get("date", None)

    response = query_historical_event(date)
    print("Assistant response:", response)
else:
    print("Assistant response:", chat_response.choices[0].message.content)

Example Workflow

  • User Query: "Can you tell me what happened on August 15, 2009?"
  • Assistant Function Call: The assistant identifies the query’s intent and calls query_historical_event with the argument date="2009-08-15".
  • Response: The function retrieves the event: "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours."

Full Code

from openai import OpenAI
import json

# Set up API client with the vLLM server settings
openai_api_key = <your-deployment-key>
openai_api_base = "https://vllm-hf.eastus2.inference.ml.azure.com/v1/"
client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

def query_historical_event(date):
    fictional_historical_events = {
        "1805-03-21": "On March 21, 1805, the Treaty of Varis signed by several European powers established the first coordinated effort to protect migratory bird species.",
        "1898-07-10": "On July 10, 1898, the Great Illumination Act was passed in London, mandating the installation of electric streetlights across all major cities in the United Kingdom.",
        "1923-09-05": "On September 5, 1923, the International Academy of Innovation was founded in Zurich, Switzerland, promoting global collaboration in scientific research.",
        "1940-02-14": "On February 14, 1940, the first underwater train tunnel connecting two countries was completed between France and the United Kingdom.",
        "1954-11-08": "On November 8, 1954, the Global Weather Watch Program was launched, pioneering the use of satellites for monitoring Earth's climate systems.",
        "1977-06-30": "On June 30, 1977, the first fully solar-powered town, Solaria, was inaugurated in Arizona, setting a benchmark for renewable energy communities.",
        "1983-12-12": "On December 12, 1983, the Universal Language Project introduced a simplified global auxiliary language intended to foster cross-cultural communication.",
        "1994-04-23": "On April 23, 1994, the Oceanic Research Pact was signed, marking a commitment by 40 nations to share oceanographic research and preserve marine ecosystems.",
        "2009-08-15": "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours.",
        "2020-01-10": "On January 10, 2020, the World Clean Air Initiative achieved its milestone goal of reducing urban air pollution levels in 50 major cities globally."
    }
    return fictional_historical_events.get(date, f"No historical event information available for {date}.")

tools = [
    {
        "function": {
            "name": "query_historical_event",
            "description": "Provides information about a historical event that occurred on a specified date.",
            "parameters": {
                "type": "object",
                "properties": {
                    "date": {
                        "type": "string",
                        "description": "The date of the event in YYYY-MM-DD format."
                    },
                },
                "required": ["date"]
            }
        }
    }
]

messages = [
    {"role": "system", "content": "You are a knowledgeable assistant that can retrieve information about historical events."},
    {"role": "user", "content": "Can you tell me what happened on August 15, 2009?"},
]

chat_response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages,
    temperature=0.7,
    max_tokens=1024,
    top_p=0.9,
    frequency_penalty=0.5,
    presence_penalty=0.6,
    tools=tools,
    tool_choice='auto'
)

if chat_response.choices[0].message.tool_calls:
    date_argument = json.loads(chat_response.choices[0].message.tool_calls[0].function.arguments)
    date = date_argument.get("date", None)

    response = query_historical_event(date)
    print("Assistant response:", response)
else:
    print("Assistant response:", chat_response.choices[0].message.content)

Response:

Tool has been called with date: 2009-08-15
Assistant response: On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours

You've successfully implemented function calling using your deployed Llama-3.1-8B model.

Conclusion

To wrap up, deploying large language models on vLLM with Azure Machine Learning Managed Online Endpoints is a simple and effective way to enable real-time AI-powered applications. By following the steps shared—from setting up the environment to testing the deployment—you can quickly integrate advanced models like Llama-3.1-8B-Instruct into your workflows. With vLLM's optimized performance and support for function calling, your applications can handle complex tasks and interact with other systems seamlessly. This setup helps you build smarter, faster, and more scalable AI solutions.

Updated Jan 31, 2025
Version 3.0
No CommentsBe the first to comment