machine learning
341 TopicsHow to Export an Agent Developed in Copilot Studio
If you've developed an agent in Copilot Studio and need to export it, follow these steps: Open Copilot Studio: Navigate to https://copilotstudio.microsoft.com Access Solutions: On the left-hand side menu, click on the three dots (...) and then select Solutions. 3. Create a New Solution: In the top menu bar, click on + New solution. Fill in the details for the new solution, including the display name, name, selecting or creating a new publisher, defining the version, and then click Create. 4. Navigate to the Solution List: Return to the list of solutions. Find and click on the name of the solution you want to export, which is usually located within default solutions. 5. Select the Agent: Click on Agents and select the agent you wish to export. In the top menu or by clicking on the three dots (...), select Advanced and then + Add to solution. 6. Add Agent to Solution: hoose the solution you created earlier and click Save. Return to the list of solutions. 7. Export the solution: Click on the solution that you have exported the agent to and click on Export Solution. Click on Next: You have the option to enable the solution checker during the export process. Once you've made your selection, click on Export. Once the export process is complete, the download button will become available. Click on the Download button to save the .zip file with your agent. By following these steps, you can successfully export your agent from Copilot Studio. How to Import an Agent Developed in Copilot Studio If you need to import an agent into Copilot Studio, follow these steps: Open Copilot Studio: Navigate to https://copilotstudio.microsoft.com Access Solutions: On the left-hand side menu, click on the three dots (...) and then select Solutions. 3. Import the Solution: In the top menu bar, click on Import solution. Click on the Browse button to open the file selection dialog. Navigate to the location where your solution file with the agent is stored, select the file, and then click Open. After selecting the file, click on Next to proceed with the import process. Carefully review all the information to ensure it is correct. Once you have double-checked the details, click on Next to proceed. Click on Import to finish. 4. Verify the Import: Make sure that the solution was imported successfully. After the import is complete, go back to the Copilot Studio home page and check for your agent. In this post, you learned how to export and import an agent in Copilot Studio. If you have any questions or run into issues, feel free to reach out for support. Happy exporting and importing!199Views0likes0Comments120 Days Study Plan to Become an AI-Focused Full-Stack Software Engineer
Hello there, my name is Oumaima, and I am an MLSA student ambassador from Morocco, studying at the University Of The People. Welcome to the first step in my exciting, unpredictable journey, one I’ve chosen to embark on with you! For the past three years, I’ve watched the AI industry evolve dramatically. Generative AI has shifted from a fascinating experiment to an integral part of our everyday lives, whether at school, work, or even in our personal routines. In fact, my ChatGPT app is now my go-to therapist, lawyer, and all-around advisor! As a software engineering student for over three years, I’ve seen the growth of generative AI up close. But this shift didn’t just inspire me; it made me realize that I don’t want to remain only a consumer of this technology. I want to contribute to it! Seeing AI’s ability to mimic human thought, draw connections from vast amounts of information, and deliver impressive results sparked something in me. It showed me that the best way to break into AI might just be to use AI itself as my guide. That’s when the idea came to ask ChatGPT O1-preview for a personalized study plan, crafted uniquely for me. It takes into account my available time, coding background, learning preferences, mental health, and energy. Here’s how my journey began with a simple prompt: I want to become an AI-focused full-stack software engineer and have 120 days to dedicate to this goal. Please create a detailed 120-day study plan tailored for me, dedicating 3-4 hours daily. The study plan should: - Cover all essential topics including programming foundations, data structures and algorithms (DS&A), mathematics for AI, machine learning fundamentals, deep learning, advanced AI topics, integrating AI into applications, web development basics for AI integration, advanced web development, full-stack project development, scripting, DevOps, and career development. - Include weekly breakdowns and daily tasks. - Provide recommended resources for each topic (e.g., online courses, tutorials, documentation). - Suggest hands-on projects or exercises to apply the concepts learned. - Incorporate tips for success, such as active engagement, seeking feedback, balancing depth and breadth, and maintaining well-being. - Emphasize developing all the skills that will make me an irreplaceable software developer, including scripting and DevOps skills. - Conclude with a summary and final advice. Please ensure the plan is structured, comprehensive, and practical for someone balancing work and study. Then it generated the following plan, that I tried to follow by using Microsoft Learn learning paths that offer in depth trainings on each topic I got: Days 1–25: Programming Foundations & Data Structures and Algorithms (DS&A) Microsoft Learn path suggestion: Python for beginners Days 26–50: Mathematics for AI & Machine Learning Fundamentals Microsoft Learn path suggestion: Introduction to machine learning Days 51–80: Deep Learning & Advanced AI Topics Microsoft Learn path suggestion: Train and evaluate deep learning models Days 81–100: Integrating AI into Applications Microsoft Learn path suggestion: Microsoft Azure AI Fundamentals: Generative AI Days 101–115: Advanced Web Development & Full-Stack Project Development Microsoft Learn path suggestion: Build an AI web app by using Python and Flask Days 116–120: Portfolio Projects and Industry Trends. Not going to lie, the roadmap turned out to be even more exciting than I’d expected! When I asked for it, I specified that it should guide me through developing problem-solving skills directly tied to full-stack development. I wanted a path that not only sharpens my abilities but also allows me to build interesting, hands-on applications where I can see the results of what I’m learning. And now, my friends, the journey has officially begun! I’ll be following the roadmap closely, documenting my weekly progress to learn AI, noting the challenges, and celebrating the accomplishments. The goal is to see if artificial intelligence can really help create a customized study plan that aligns with my personal goals, circumstances, and unique learning rhythm. So, stay tuned — this is only the beginning! See you in my first step with DSA!2.1KViews1like4CommentsThe Future of AI: Harnessing AI for E-commerce - personalized shopping agents
Explore the development of personalized shopping agents that enhance user experience by providing tailored product recommendations based on uploaded images. Leveraging Azure AI Foundry, these agents analyze images for apparel recognition and generate intelligent product recommendations, creating a seamless and intuitive shopping experience for retail customers.376Views5likes2CommentsFine-Tuning DeepSeek-R1-Distill-Llama-8B with PyTorch FSDP, QLoRA on Azure Machine Learning
Large Language Models (LLMs) have demonstrated remarkable capabilities across various industries, revolutionizing how we approach tasks like legal document summarization, creative content generation, and customer sentiment analysis. However, adapting these general-purpose models to excel in specific domains often requires fine-tuning. This is where fine-tuning comes in, allowing us to tailor LLMs to meet unique requirements and improve their performance on targeted tasks. In this blog post, we'll explore the process of fine-tuning the DeepSeek-R1-Distill-Llama-8B model, highlighting the advantages of using PyTorch Fully Sharded Data Parallel (FSDP) and Quantization-Aware Low-Rank Adaptation (QLoRA) techniques in conjunction with the Azure Machine Learning platform. Why Fine-Tuning Matters In some cases, LLMs may not perform well on specific domains, tasks, or datasets, or may produce inaccurate or misleading outputs. In such cases, fine-tuning the model can be a useful technique to adapt it to the desired goal and improve its quality and reliability. Hallucinations: Hallucinations are untrue statements output by the model. They can harm the credibility and trustworthiness of your application. One possible mitigation is fine-tuning the model with data that contains accurate and consistent information. Accuracy and quality problems: Pre-trained models may not achieve the desired level of accuracy or quality for a specific task or domain. This shortfall can be due a mismatch between the pre-training data and the target data, the diversity and complexity of the target data, and/or incorrect evaluation metrics and criteria. DeepSeek-R1 is an open-source language model excelling in text-based tasks, including creative writing, question answering, editing, and summarization. It's particularly strong in reasoning-intensive tasks like coding, math, and explaining scientific concepts. DeepSeek-R1 stands out due to its mixture of experts (MoE) architecture and use of reinforcement learning, achieving high performance with greater efficiency and lower costs compared to other models. It has 671 billion parameters across multiple expert networks, but only 37 billion are required for a single forward pass. DeepSeek-R1 uses reinforcement learning (RL) to generate a chain-of-thought (CoT) before delivering its final answer. To make these capabilities more accessible, DeepSeek has distilled its R1 outputs into several smaller models. DeepSeek has also created smaller, distilled versions based on Qwen and Llama architectures. Qwen-based distilled models: 1.5B, 7B, 14B and 32B Llama-based distilled models: 8B and 70B DeepSeek-R1-Distill-Llama-8B is a distilled large language model (LLM) based on the Llama architecture, created using outputs from the larger DeepSeek-R1 model. Through knowledge distillation, the reasoning patterns of the larger 671 billion parameter DeepSeek-R1 model are transferred into a smaller, more efficient model. The DeepSeek-R1-Distill-Llama-8B has only 8 billion parameters, making it computationally efficient while retaining a significant portion of the original model's performance. It is fine-tuned from models like Llama-3.1-8B-Instruct, achieving high performance across multiple benchmarks. This distilled model offers a balance of performance and resource requirements, improving inference speed and reducing computational costs, making it cost-effective for production deployments. PyTorch FSDP: Scaling Fine-Tuning with Data Parallelism PyTorch Fully Sharded Data Parallel (FSDP) is a distributed training framework that addresses the challenges of fine-tuning large models by sharding model parameters, optimizer states, and gradients across multiple GPUs. This technique enables you to train models with billions of parameters on systems with limited GPU memory. QLoRA: Efficient Fine-Tuning with Quantization and Low-Rank Adaptation Quantization-Aware Low-Rank Adaptation (QLoRA) is a parameter-efficient fine-tuning technique that reduces memory usage and accelerates training by quantizing the model weights and fine-tuning only a small subset of parameters. QLoRA leverages Low-Rank Adaptation (LoRA) to fine-tune only a small subset of the model’s parameters, making training faster and memory efficient. Azure Machine Learning: Your Platform for Scalable Fine-Tuning Azure Machine Learning provides a robust platform for fine-tuning LLMs, offering a comprehensive suite of tools and services to streamline the process. Scalable Compute: Azure Machine Learning Compute provides virtual machines (VMs) that run parts of the distributed deep learning job, auto-scaling as necessary. Azure Machine Learning compute clusters can schedule tasks, collect results, adjust resources to actual loads, and manage errors[5]. VMs that participate in the cluster can be GPU-enabled to accelerate deep learning calculations. Data Storage: Azure offers standard and premium blob storage options for storing training data and execution logs. Premium blob storage is used to store training data and enable high-performance access during model training, which is needed for distributed training. Experiment Tracking: Azure Machine Learning provides tools for tracking and managing your fine-tuning experiments, allowing you to monitor performance metrics and reproduce your results. Hands-on lab Now let’s start finetune and deploy the same on AML. Lets sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. It imports necessary libraries and handles exceptions during the ML client initialization. # import required libraries """ This script sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. It imports necessary libraries and handles exceptions during the ML client initialization. Modules imported: - time: Provides various time-related functions. - azure.identity: Provides authentication capabilities with DefaultAzureCredential and InteractiveBrowserCredential. - azure.ai.ml: Contains classes and functions for interacting with Azure ML services, including MLClient, Input, pipeline, load_component, command, Data, Environment, BuildContext, Model, Input, Output, and AssetTypes. - azure.core.exceptions: Contains exceptions for handling resource-related errors. - os: Provides a way to interact with the operating system. Variables: - credential: An instance of DefaultAzureCredential used for authenticating with Azure services. - ml_client: An instance of MLClient initialized using the provided credentials. If the initialization fails, an exception is caught and printed. """ import time from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential from azure.ai.ml import MLClient, Input from azure.ai.ml.dsl import pipeline from azure.ai.ml import load_component from azure.ai.ml import command from azure.ai.ml.entities import Data, Environment, BuildContext from azure.ai.ml.entities import Model from azure.ai.ml import Input from azure.ai.ml import Output from azure.ai.ml.constants import AssetTypes from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError import os credential = DefaultAzureCredential() ml_client = None try: ml_client = MLClient.from_config(credential) except Exception as ex: print(ex) Now lets install some libraries required to download the dataset and run the openai client. %conda run -n azureml_py310_sdkv2 pip install datasets==3.2.0 openai Lets create our training environment. os.makedirs("environment_train", exist_ok=True) Lets build our docker environment. %%writefile environment_train/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch22x:biweekly.202501.3 USER root # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update && apt-get -y upgrade RUN pip install --upgrade pip RUN apt-get install -y openssh-server openssh-client # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation Let’s also specify our requirements.txt %%writefile environment_train/requirements.txt transformers==4.48.2 peft==0.14.0 accelerate==1.3.0 bitsandbytes==0.45.1 datasets==3.2.0 evaluate==0.4.3 huggingface_hub[hf_transfer] safetensors>=0.5.2 sentencepiece==0.2.0 scikit-learn==1.6.1 tokenizers>=0.21.0 py7zr Once we specify both lets create the AML custom training environment. env_name = "deepseek-training" env_docker_image = Environment( build=BuildContext(path = "environment_train", dockerfile_path="Dockerfile"), name=env_name, description="Environment created for llm fine-tuning.", version="1" ) env_asset_train = ml_client.environments.create_or_update(env_docker_image) While the training environment is ready let’s start with the dataset preparation. from datasets import load_dataset import pandas as pd dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en") df = pd.DataFrame(dataset['train']) df = df.iloc[0:2000] df.head() Here is quick snapshot of what the dataset looks like Noe lets split the dataset into train and test for validation. from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.1, random_state=42) print("Number of train elements: ", len(train)) print("Number of test elements: ", len(test)) Let’s create the prompt template to run the finetuning process. In this case we have used COT prompt template. # custom instruct prompt start prompt_template = f""" <|begin▁of▁sentence|> You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response. <|User|> {{question}} <|Assistant|> <think> {{complex_cot}} </think> {{answer}} <|end▁of▁sentence|> """ # template dataset to add prompt to each sample def template_dataset(sample): sample["text"] = prompt_template.format(question=sample["Question"], complex_cot=sample["Complex_CoT"], answer=sample["Response"]) return sample Let’s run the mapping of this prompt through the whole dataset and create train and test jsonl files.. from datasets import Dataset, DatasetDict from random import randint train_dataset = Dataset.from_pandas(train) test_dataset = Dataset.from_pandas(test) dataset = DatasetDict({"train": train_dataset, "test": test_dataset}) train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features)) print(train_dataset[randint(0, len(dataset))]["text"]) test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features)) train_dataset.to_json(f"data/train.jsonl") test_dataset.to_json(f"data/eval.jsonl") Now let’s start creating our training script. os.makedirs("src_train", exist_ok=True) write the train.py which uses both Qlora and PyTorch FSDP. %%writefile src_train/train.py import os import argparse import sys import logging from accelerate import Accelerator import datetime from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed import transformers import traceback from huggingface_hub import snapshot_download from datasets import load_dataset def download_model(model_name): print("Downloading model ", model_name) os.makedirs("/tmp/tmp_folder", exist_ok=True) snapshot_download(repo_id=model_name, local_dir="/tmp/tmp_folder") print(f"Model {model_name} downloaded under /tmp/tmp_folder") def init_distributed(): # Initialize the process group torch.distributed.init_process_group( backend="nccl", # Use "gloo" backend for CPU timeout=datetime.timedelta(seconds=5400) ) local_rank = int(os.environ["LOCAL_RANK"]) torch.cuda.set_device(local_rank) return local_rank def main(args): model_name = args.model_name_or_path train_ds = load_dataset('json', data_files=args.train_file, split='train') test_ds = load_dataset('json', data_files=args.eval_file, split='train') per_device_train_batch_size=args.train_batch_size per_device_eval_batch_size=args.eval_batch_size gradient_accumulation_steps=args.grad_accum_steps learning_rate=args.learning_rate num_train_epochs=args.epochs lora_r=8 lora_alpha=16 lora_dropout=0.1 fsdp="full_shard auto_wrap offload" fsdp_config={ 'backward_prefetch': 'backward_pre', 'cpu_ram_efficient_loading': True, 'offload_params': True, 'forward_prefetch': False, 'use_orig_params': False } gradient_checkpointing=False merge_weights=True seed=42 token=None model_dir = args.model_dir if torch.cuda.is_available() and (torch.cuda.device_count() > 1 or int(os.environ.get("SM_HOST_COUNT", 1)) > 1): # Call this function at the beginning of your script local_rank = init_distributed() # Now you can use distributed functionalities torch.distributed.barrier(device_ids=[local_rank]) os.environ.update({"HF_HUB_ENABLE_HF_TRANSFER": "1"}) set_seed(seed) accelerator = Accelerator() if token is not None: os.environ.update({"HF_TOKEN": token}) accelerator.wait_for_everyone() if int(os.environ.get("SM_HOST_COUNT", 1)) == 1: if accelerator.is_main_process: download_model(model_name) else: download_model(model_name) accelerator.wait_for_everyone() model_name = "/tmp/tmp_folder" tokenizer = AutoTokenizer.from_pretrained(model_name) # Set Tokenizer pad Token tokenizer.pad_token = tokenizer.eos_token with accelerator.main_process_first(): # tokenize and chunk dataset lm_train_dataset = train_ds.map( lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features) ) print(f"Total number of train samples: {len(lm_train_dataset)}") if test_ds is not None: lm_test_dataset = test_ds.map( lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features) ) print(f"Total number of test samples: {len(lm_test_dataset)}") else: lm_test_dataset = None torch_dtype = torch.bfloat16 # Defining additional configs for FSDP if fsdp != "" and fsdp_config is not None: bnb_config_params = { "bnb_4bit_quant_storage": torch_dtype } model_configs = { "torch_dtype": torch_dtype } fsdp_configurations = { "fsdp": fsdp, "fsdp_config": fsdp_config, "gradient_checkpointing_kwargs": { "use_reentrant": False }, "tf32": True } else: bnb_config_params = dict() model_configs = dict() fsdp_configurations = dict() bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch_dtype, **bnb_config_params ) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, quantization_config=bnb_config, attn_implementation="flash_attention_2", use_cache=not gradient_checkpointing, cache_dir="/tmp/.cache", **model_configs ) if fsdp == "" and fsdp_config is None: model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing) if gradient_checkpointing: model.gradient_checkpointing_enable() config = LoraConfig( r=lora_r, lora_alpha=lora_alpha, target_modules="all-linear", lora_dropout=lora_dropout, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, config) trainer = transformers.Trainer( model=model, train_dataset=lm_train_dataset, eval_dataset=lm_test_dataset if lm_test_dataset is not None else None, args=transformers.TrainingArguments( per_device_train_batch_size=per_device_train_batch_size, per_device_eval_batch_size=per_device_eval_batch_size, gradient_accumulation_steps=gradient_accumulation_steps, gradient_checkpointing=gradient_checkpointing, logging_strategy="steps", logging_steps=1, log_on_each_node=False, num_train_epochs=num_train_epochs, learning_rate=learning_rate, bf16=True, ddp_find_unused_parameters=False, save_strategy="no", output_dir="outputs", **fsdp_configurations ), data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), ) if trainer.accelerator.is_main_process: trainer.model.print_trainable_parameters() trainer.train() if trainer.is_fsdp_enabled: trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT") if merge_weights: output_dir = "/tmp/model" # merge adapter weights with base model and save # save int 4 model trainer.model.save_pretrained(output_dir, safe_serialization=False) if accelerator.is_main_process: # clear memory del model del trainer torch.cuda.empty_cache() # load PEFT model model = AutoPeftModelForCausalLM.from_pretrained( output_dir, torch_dtype=torch.float16, low_cpu_mem_usage=True, trust_remote_code=True, ) # Merge LoRA and base model and save model = model.merge_and_unload() model.save_pretrained( model_dir, safe_serialization=True, max_shard_size="2GB" ) else: trainer.model.save_pretrained( model_dir, safe_serialization=True ) if accelerator.is_main_process: tokenizer.save_pretrained(model_dir) accelerator.wait_for_everyone() def parse_args(): # setup argparse parser = argparse.ArgumentParser() # curr_time = datetime.now().strftime("%Y-%m-%d_%H:%M:%S") # hyperparameters parser.add_argument("--model_name_or_path", default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", type=str, help="Input directory for training") parser.add_argument("--train_file", type=str, help="Input data for training") parser.add_argument("--eval_file", type=str, help="Input data for eval") parser.add_argument("--epochs", default=1, type=int, help="number of epochs") parser.add_argument("--train_batch_size", default=2, type=int, help="training - mini batch size for each gpu/process") parser.add_argument("--eval_batch_size", default=4, type=int, help="evaluation - mini batch size for each gpu/process") parser.add_argument("--grad_accum_steps", default=4, type=int, help="gradient accumulation steps") parser.add_argument("--learning_rate", default=2e-4, type=float, help="learning rate") parser.add_argument("--save_merged_model", type=bool, default=False) parser.add_argument("--model_dir", type=str, default="./", help="output directory for model") # parse args args = parser.parse_args() # return args return args if __name__ == "__main__": #sys.argv = [''] args = parse_args() main(args) Next step is to create a compute cluster on which the training will run. azure_compute_cluster_name = "a100-compute" azure_compute_cluster_size = "Standard_NC24ads_A100_v4" USE_LOWPRIORITY_VM = True from azure.ai.ml.entities import AmlCompute ### Create the compute cluster try: compute = ml_client.compute.get(azure_compute_cluster_name) except Exception as ex: try: tier = "LowPriority" if USE_LOWPRIORITY_VM else "Dedicated" compute = AmlCompute( name=azure_compute_cluster_name, size=azure_compute_cluster_size, tier=tier, max_instances=1, # For multi node training set this to an integer value more than 1 ) ml_client.compute.begin_create_or_update(compute).wait() except Exception as e: print(e) Once the compute is ready, lets run the training job. from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration str_command = "" str_command += "python train.py --train_file ${{inputs.train_file}} --eval_file ${{inputs.eval_file}} \ --epochs ${{inputs.epoch}} --train_batch_size ${{inputs.train_batch_size}} \ --eval_batch_size ${{inputs.eval_batch_size}} --model_name_or_path ${{inputs.model_name_or_path}} \ --model_dir ${{inputs.model_dir}} --save_merged_model ${{inputs.save_merged_model}}" job = command( inputs=dict( train_file=Input( type="uri_file", path="data/train.jsonl", ), eval_file=Input( type="uri_file", path="data/eval.jsonl", ), epoch=1, train_batch_size=2, eval_batch_size=1, model_name_or_path="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", model_dir="./outputs", save_merged_model = True ), code="./src_train", # local path where the code is stored compute=azure_compute_cluster_name, command=str_command, environment=env_asset_train, distribution={ "type": "PyTorch", "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1 }, ) returned_job = ml_client.jobs.create_or_update(job) ml_client.jobs.stream(returned_job.name) Once the training is completed, lets register the model as a custom model type. from azure.ai.ml.entities import Model from azure.ai.ml.constants import AssetTypes run_model = Model( path=f"azureml://jobs/{returned_job.name}/outputs/artifacts/paths/outputs/", name="deepseekr1-dist-llama8bft", description="Model created from run.", type=AssetTypes.CUSTOM_MODEL, ) model = ml_client.models.create_or_update(run_model) Once the model is registered the next step is to deploy the same as Online Managed Endpoint. from azure.ai.ml.entities import ( ManagedOnlineEndpoint, IdentityConfiguration, ManagedIdentityConfiguration, ) endpoint_name = "deepseekr1-dist-llama8bft-ep" # Check if the endpoint already exists in the workspace try: endpoint = ml_client.online_endpoints.get(endpoint_name) print("---Endpoint already exists---") except: # Create an online endpoint if it doesn't exist # Define the endpoint endpoint = ManagedOnlineEndpoint( name=endpoint_name, description=f"Test endpoint for {model.name}" ) # Trigger the endpoint creation try: ml_client.begin_create_or_update(endpoint).wait() print("\n---Endpoint created successfully---\n") except Exception as err: raise RuntimeError( f"Endpoint creation failed. Detailed Response:\n{err}" ) from err Let’s define the deployment name , SKU type of the VM and Request timeout parameter. # Initialize deployment parameters deployment_name = "deepseekr1-dist-llama8bftd-eploy" sku_name = "Standard_NC24ads_A100_v4" REQUEST_TIMEOUT_MS = 90000 os.makedirs("environment_inf", exist_ok=True) Lets create the environment for our inference . %%writefile environment_inf/Dockerfile FROM vllm/vllm-openai:latest ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_NAME $VLLM_ARGS Let’s build the environment with the docker file created above. from azure.ai.ml.entities import Environment, BuildContext env_docker_image = Environment( build=BuildContext(path="environment_inf", dockerfile_path= "Dockerfile"), name="vllm-custom", description="Environment created from a Docker context.", inference_config={ "liveness_route": { "port": 8000, "path": "/health", }, "readiness_route": { "port": 8000, "path": "/health", }, "scoring_route": { "port": 8000, "path": "/", }, }, ) env_asset_inf = ml_client.environments.create_or_update(env_docker_image) Once our environment for inference server is ready let’s do the deployment. Lets define some environment variables model_path = f"/var/azureml-app/azureml-models/{model.name}/{model.version}/outputs" env_vars = { "MODEL_NAME": model_path, "VLLM_ARGS": "--max-model-len 16000 --enforce-eager", } deployment_env_vars = {**env_vars} Lets do the deployment now. import time from azure.ai.ml.entities import ( OnlineRequestSettings, CodeConfiguration, ManagedOnlineDeployment, ProbeSettings, Environment ) t0 = time.time() deployment = ManagedOnlineDeployment( name= deployment_name, endpoint_name=endpoint_name, model=model, instance_type=sku_name, instance_count=1, environment_variables=deployment_env_vars, environment=env_asset_inf, request_settings=OnlineRequestSettings( max_concurrent_requests_per_instance=2, request_timeout_ms=50000, max_queue_wait_ms=60000 ), liveness_probe=ProbeSettings( failure_threshold=5, success_threshold=1, timeout=10, period=30, initial_delay=120 ), readiness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, timeout=2, period=10, initial_delay=120, ), ) # Trigger the deployment creation try: ml_client.begin_create_or_update(deployment).wait() except Exception as err: raise RuntimeError( f"Deployment creation failed. Detailed Response:\n{err}" ) from err endpoint.traffic = {deployment_name: 100} endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint) Wow!! Our endpoint is now deployed. Let’s start testing the same. endpoint_results = endpoint_poller.result() endpoint_name = endpoint_results.name keys = ml_client.online_endpoints.get_keys(name=endpoint_name) primary_key = keys.primary_key url = os.path.join(endpoint_results.scoring_uri, "v1") endpoint_name = ( endpoint_results.name if endpoint_name is None else endpoint_name ) keys = ml_client.online_endpoints.get_keys(name=endpoint_name) Once we get the API keys we can use openai client to stream the tokens. from openai import OpenAI vllm_client = OpenAI(base_url=url, api_key=primary_key) # Create your prompt system_message = """You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.""" user_message = f"""A 3-week-old child has been diagnosed with late onset perinatal meningitis, and the CSF culture shows gram-positive bacilli. What characteristic of this bacterium can specifically differentiate it from other bacterial agents?""" response = vllm_client .chat.completions.create( model=model_path, messages=[ {"role": "system", "content": system_message}, {"role": "user", "content": user_message}, ], temperature=0.7, max_tokens=4000, stream=True, # Stream the response ) print("Streaming response:") for chunk in response: delta = chunk.choices[0].delta if hasattr(delta, "content"): print(delta.content, end="", flush=True) Conclusion Fine-tuning the DeepSeek-R1-Distill-Llama-8B model with PyTorch FSDP and QLoRA on Azure Machine Learning offers a powerful approach to customising LLMs for specific tasks. By leveraging the scalability and efficiency of these techniques, you can unlock the full potential of LLMs and drive innovation in your respective domain. Hope you liked the blog. Do like the blog and follow me for more such content. Thanks Manoranjan Rajguru AI Global Black Belt3KViews0likes0CommentsScalable and Efficient Fine-Tuning of LLM on Azure ML
https://github.com/james-tn/llm-fine-tuning/tree/main/opensource_llm/single_step Co-Author: Mohamad AL jazaery Why Scalable and Efficient Fine-Tuning Matters Faster Iterations, Shorter Time-to-Value: In today’s competitive AI landscape, time is of the essence. The faster you can fine-tune a model, the quicker you can validate ideas, test hypotheses, and bring solutions to market. High-profile GPU machines are costly: High-performance GPUs and compute clusters don’t come cheap, and their availability is often limited. Efficient fine-tuning techniques, such as model sharding and distributed training, maximize the utilization of these precious resources—ensuring that you get the most out of your infrastructure investment. Choosing the Right Azure ML GPU Compute for the Job: NC or ND? Not all GPU computes are created equal, and choosing the right sku can make or break your training efficiency. ND Series: Ideal for distributed training across multiple nodes, thanks to its Infiniband (IB) connectivity that ensures high-speed communication between nodes like pretraining LLM or finetuning very large model ~70B params. NC Series: Small and medium workload where no heavy interaction between nodes needed like LLM inferencing or mid-size LLM finetuning. Azure GPU Machine Options by Scenario: Scenario Common model size Training Approach Recommended Azure Compute Small-scale fine-tuning < 3B parameters Parameter-efficient tuning NCas_T4_v3 (Tesla T4, 16 GB) Medium-scale fine-tuning 1–5B parameters Full or parameter-efficient NCs_v3 (Tesla V100, 16 GB) Distributed training for medium models 5–10B parameters Full fine-tuning ND_v2 (Tesla V100 NVLINK, 32 GB, InfiniBand) Large-scale fine-tuning (single machine) 10–30B parameters Full or parameter-efficient NC_A100_v4 (A100, 40 GB) Distributed training for very large models 20–70B parameters Full fine-tuning NDasrA100_v4 (A100, 80 GB, HDR InfiniBand) Very large models training (single machine) up to 70B parameters Full or parameter-efficient NCads_H100_v5 (H100 NVL, 94 GB) Massive-scale distributed training > 70B parameters Full fine-tuning ND-H100-v5 (H100, 80 GB, scale-out InfiniBand) Distributed Efficient Training: A Quick Guide When scaling fine-tuning tasks, choosing the right distributed training method is key: DDP (Data Parallelism): Works well when the entire model fits on a single GPU. It replicates the model across multiple GPUs and splits the data for parallel processing. Check experiment 1 in the following section. Model Parallelism: A game-changer for massive models that don’t fit on a single GPU. It shards not only the data but also the model parameters and optimizer states across multiple GPUs, enabling efficient training of models like LLaMA-70B on GPUs with low memory GPUs. Both FSDP and DeepSpeed as libraries excel at implementing advanced forms of model parallelism and memory optimization. Memory Optimization Techniques Gradient Checkpointing: Reduces memory by recomputing activations during the backward pass, trading memory for additional computation. Mixed Precision Training: Reduces memory usage by using FP16 or BF16 instead of FP32, accelerating training while maintaining numerical stability. Supported by both frameworks. Quantization (DeepSpeed Exclusive): Uses INT8 precision for weights and activations, dramatically reducing memory and compute requirements. Offloading (DeepSpeed Exclusive): Offloads optimizer states and model parameters to CPU or NVMe, freeing up GPU memory for computation. Our Experiments: Pushing the Limits of Scalability Experiment 1: Distributed Training on Multiple Nodes using DDP We conducted an experiment to fine-tune the Llama-3.1-8B model using LoRA (Low-Rank Adaptation) on Azure ML NDv2-V100 nodes. The goal was to evaluate the efficiency of fine-tuning across different numbers of nodes (1, 2, and 3) and observe the impact on training time and throughput. Azure ML Job YAML Definition $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml/models/mistralai-Mistral-7B-v01/versions/19 command: > accelerate launch --num_processes 16 # gpu per machine * num of machines --num_machines 2 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT compute: azureml:ndv2-cluster resources: instance_count: 2 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Results: As you increased the number of nodes from one to three, the throughput increased proportionally. This indicates that the system scaled efficiently with the addition of more nodes, maintaining a close-to-linear improvement in throughput. Experiment 2: Model Parallelism using FSDP Fine-tuning a 70B-parameter model on GPUs with only 16GB of memory might sound impossible, but we made it happen using FSDP (Full Sharded Data Parallelism) on Azure ML using a cluster of multiple NDv2-V100 nodes. By distributing not only the data but also the model parameters and optimizer states across multiple nodes, we unlocked the power of full sharding. $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml-meta/models/Llama-3.3-70B-Instruct/versions/4 command: > accelerate launch --config_file "configs/fsdp_config.yaml" --num_processes 32 --num_machines 4 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT train.py compute: azureml:ndv2-cluster resources: instance_count: 4 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Key Takeaways: Memory Efficiency: Full sharding enabled us to fine-tune the LLaMA-70B model on V100 GPUs despite their limited memory. Connectivity Matters: The Infiniband (IB) connectivity of ND nodes played a critical role in ensuring smooth communication across GPUs, making this feat possible. Conclusion Scalable and efficient fine-tuning is the key to unlocking the true potential of Large Language Models. By leveraging distributed training techniques, such as FSDP and DDP, and optimizing compute resources on Azure ML, researchers and practitioners can overcome the challenges of training massive models—reducing costs, accelerating time-to-value, and driving AI innovation. Access the code and start experimenting here! Future work: The second part will focus on real-world pipeline setups, including end-to-end model training, hyperparameter optimization, and testing. The third part will dive into deploying trained models for practical use. Future posts may explore best practices for specific fine-tuning scenarios and techniques.1.2KViews3likes0CommentsUnlocking Function Calling with vLLM and Azure Machine Learning
Introduction In this post, we’ll explain how to deploy LLMs on vLLM using Azure Machine Learning’s Managed Online Endpoints for efficient, scalable, and secure real-time inference. Next, we will look at function calling, and how vLLM's engine can support you to achieve that. To get started, let’s briefly look into what vLLM and Managed Online Endpoints are. You can find the full code examples on vllm-on-azure-machine-learning. vLLM vLLM is a high-throughput and memory-efficient inference and serving engine designed for large language models (LLMs). It optimizes the serving and execution of LLMs by utilizing advanced memory management techniques, such as PagedAttention, which efficiently manages attention key and value memory. This allows for continuous batching of incoming requests and fast model execution, making vLLM a powerful tool for deploying and serving LLMs at scale. vLLM supports seamless integration with popular Hugging Face models and offers various decoding algorithms, including parallel sampling and beam search. It also supports tensor parallelism and pipeline parallelism for distributed inference, making it a flexible and easy-to-use solution for LLM inference (see full docs). Managed Online Endpoints in Azure Machine Learning Managed Online Endpoints in Azure Machine Learning provide a streamlined and scalable way to deploy machine learning models for real-time inference. These endpoints handle the complexities of serving, scaling, securing, and monitoring models, allowing us to focus on building and improving your models without worrying about infrastructure management. HuggingFace Model Deployment Let’s go through deploying a HuggingFace model on Azure Machine Learning’s Managed Online Endpoints. For this, we’ll use a custom Dockerfile and configuration files to set up the deployment. As a model, we’ll be using meta-llama/Llama-3.1-8B-Instruct on a single Standard_NC24ads_A100_v4 instance. Step 1: Create a custom Environment for vLLM on AzureML First, we create a Dockerfile to define the environment for our model. For this, we’ll be using vllm’s base container that has all the dependencies and drivers included: FROM vllm/vllm-openai:latest ENV MODEL_NAME facebook/opt-125m ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_NAME $VLLM_ARGS The idea here is that we can pass a model name via an ENV variable, so that we can easily define which model we want to deploy during deployment time. Next, we log into our Azure Machine Learning workspace: az account set --subscription <subscription ID> az configure --defaults workspace=<Azure Machine Learning workspace name> group=<resource group> Now, we create an environment.yml file to specify the environment settings: $schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json name: vllm build: path: . dockerfile_path: Dockerfile Then let’s build the environment: az ml environment create -f environment.yml Step 2: Deploy the AzureML Managed Online Endpoint Time for deployment, so let’s first create an endpoint.yml file to define the Managed Online Endpoint: $schema: https://azuremlsdk2.blob.core.windows.net/latest/managedOnlineEndpoint.schema.json name: vllm-hf auth_mode: key Let’s create it: az ml online-endpoint create -f endpoint.yml For the next step, we’ll need the address of the Docker image address we created. We can quickly get it from AzureML Studio -> Environments -> vllm: Finally, we create a `deployment.yml file to configure the deployment settings and deploy our desired model from HuggingFace via vLLM: $schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json name: current endpoint_name: vllm-hf environment_variables: MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct # define the model name using the identifier from HG VLLM_ARGS: "--enable-auto-tool-choice --tool-call-parser llama3_json" HUGGING_FACE_HUB_TOKEN: xxxxxxxxxxxxxx # Add your HF API key here environment: image: xxxxxxxxx.azurecr.io/azureml/azureml_xxxxxxxxxxx # Replace with your own image inference_config: liveness_route: port: 8000 path: /ping readiness_route: port: 8000 path: /health scoring_route: port: 8000 path: / instance_type: Standard_NC24ads_A100_v4 instance_count: 1 request_settings: request_timeout_ms: 60000 max_concurrent_requests_per_instance: 16 liveness_probe: initial_delay: 10 period: 10 timeout: 2 success_threshold: 1 failure_threshold: 30 readiness_probe: initial_delay: 120 period: 10 timeout: 2 success_threshold: 1 failure_threshold: 30 Since vLLM does not support separate probes for readiness and liveness, we’ll need to make sure that the model has fully loaded before the fire the first probe. This is why we increased readiness_probe.initial_delay to 120s. For larger models, we should also follow vLLM’s documentation for using tensor parallel inference (model on single node but spanning multiple GPUs) by adding --tensor-parallel-size <NUM_OF_GPUs> to VLLM_ARGS. Since we’re using a single A100 GPU in our example (Standard_NC24ads_A100_v4), this is not required though. The request_settings depend a bit on our instance type/size and might require some manual tuning to get the model run properly and efficiently. Goal is to find a good tradeoff between concurrency (max_concurrent_requests_per_instance) and queue time in order to avoid either hitting request_timeout_ms from the endpoint side, or any HTTP-timeouts on the client side. Both these scenarios result in HTTP 429, and the client would need to implement exponential backoff (e.g. via tenacity library). Lastly, we can deploy the model: az ml online-deployment create -f deployment.yml --all-traffic By following these steps, we have deployed a HuggingFace model on Azure Machine Learning’s Managed Online Endpoints, ensuring efficient and scalable real-time inference. Time to test it! Step 3: Testing the deployment# First, let’s get the endpoint’s scoring uri and the api keys: az ml online-endpoint show -n vllm-hf az ml online-endpoint get-credentials -n vllm-hf For completion models, we can then call the endpoint using this Python code snippet: import requests url = "https://vllm-hf.polandcentral.inference.ml.azure.com/v1/completions" headers = { "Content-Type": "application/json", "Authorization": "Bearer xxxxxxxxxxxx" } data = { "model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "San Francisco is a", "max_tokens": 200, "temperature": 0.7 } response = requests.post(url, headers=headers, json=data) print(response.json()) Response: { "id": "cmpl-98d658cf-6310-4c87-a24f-723dda6db176", "object": "text_completion", "created": 1738267352, "model": "meta-llama/Llama-3.1-8B-Instruct", "choices": [ { "index": 0, "text": " top tourist destination known for its iconic Golden Gate Bridge, steep hills, vibrant neighborhoods, and cultural attractions. The city is a haven for foodies, with a diverse range of cuisines available, from seafood to Mexican to Chinese and more.\nOne of the best ways to experience San Francisco is by taking a ride on a historic cable car, which offers stunning views of the city and its surroundings. Explore the historic Fisherman's Wharf, a bustling waterfront district filled with seafood restaurants, street performers, and souvenir shops.\nVisit the vibrant neighborhoods of Haight-Ashbury and the Mission District, known for their colorful street art, independent shops, and lively music scenes. Take a stroll through Golden Gate Park, a sprawling urban park that features gardens, lakes, and walking and biking trails.\n\nThe city has a thriving arts and culture scene, with numerous museums, galleries, and performance venues. The San Francisco Museum of Modern Art (SFMOMA) is one of the largest modern art museums in", "logprobs": null, "finish_reason": "length", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 205, "completion_tokens": 200, "prompt_tokens_details": null } } Works! Function Calling Function calling in the context of large language models (LLMs) refers to the model's ability to dynamically generate and call structured functions based on context, user input, or specific task requirements. It enables seamless interaction with APIs, databases, or external tools while leveraging the model's reasoning capabilities. vLLM provides an OpenAI-compatible server that supports the Completions, Chat Completions, and Embeddings APIs. For instance, it enables developers to seamlessly integrate models into existing workflows. Developers can use the official OpenAI Python client or any HTTP client to interact with vLLM, making it straightforward to integrate into existing workflows. Before running the code, ensure you have the OpenAI library installed by executing: pip install openai The following code demonstrates the function-calling capabilities of vLLM using an example where the assistant retrieves information about historical events based on a provided date: Lets go through it step by step 1. Defining a Custom Function: A query_historical_event function is defined, containing a dictionary of fictional historical events. This function serves as a callable endpoint for vLLM to retrieve information based on a user-specified date. def query_historical_event(date): fictional_historical_events = { "1805-03-21": "On March 21, 1805, the Treaty of Varis signed by several European powers established the first coordinated effort to protect migratory bird species.", "1898-07-10": "On July 10, 1898, the Great Illumination Act was passed in London, mandating the installation of electric streetlights across all major cities in the United Kingdom.", "1923-09-05": "On September 5, 1923, the International Academy of Innovation was founded in Zurich, Switzerland, promoting global collaboration in scientific research.", "1940-02-14": "On February 14, 1940, the first underwater train tunnel connecting two countries was completed between France and the United Kingdom.", "1954-11-08": "On November 8, 1954, the Global Weather Watch Program was launched, pioneering the use of satellites for monitoring Earth's climate systems.", "1977-06-30": "On June 30, 1977, the first fully solar-powered town, Solaria, was inaugurated in Arizona, setting a benchmark for renewable energy communities.", "1983-12-12": "On December 12, 1983, the Universal Language Project introduced a simplified global auxiliary language intended to foster cross-cultural communication.", "1994-04-23": "On April 23, 1994, the Oceanic Research Pact was signed, marking a commitment by 40 nations to share oceanographic research and preserve marine ecosystems.", "2009-08-15": "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours.", "2020-01-10": "On January 10, 2020, the World Clean Air Initiative achieved its milestone goal of reducing urban air pollution levels in 50 major cities globally." } return fictional_historical_events.get(date, f"No historical event information available for {date}.") 2. Tool Integration: The function is wrapped in a tools definition, which includes metadata such as the function’s name, description, and expected parameters (e.g., the date in YYYY-MM-DD format). tools = [ { "function": { "name": "query_historical_event", "description": "Provides information about a historical event that occurred on a specified date.", "parameters": { "type": "object", "properties": { "date": { "type": "string", "description": "The date of the event in YYYY-MM-DD format." }, }, "required": ["date"] } } } ] 3. Conversation Workflow: The conversation starts with a system message setting the assistant's role and a user query about a specific date. The assistant evaluates the query and decides if the custom function is needed. messages = [ {"role": "system", "content": "You are a knowledgeable assistant that can retrieve information about historical events."}, {"role": "user", "content": "Can you tell me what happened on August 15, 2009?"}, ] 4. Function Call Handling: If the assistant determines that the function is required, it: Parses the function call and extracts the necessary parameters (e.g., date). Executes the query_historical_event function with the provided arguments and returns the result to the user. chat_response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=messages, temperature=0.7, max_tokens=1024, top_p=0.9, frequency_penalty=0.5, presence_penalty=0.6, tools=tools, tool_choice='auto' ) if chat_response.choices[0].message.tool_calls: date_argument = json.loads( chat_response.choices[0].message.tool_calls[0].function.arguments) date = date_argument.get("date", None) response = query_historical_event(date) print("Assistant response:", response) else: print("Assistant response:", chat_response.choices[0].message.content) Example Workflow User Query: "Can you tell me what happened on August 15, 2009?" Assistant Function Call: The assistant identifies the query’s intent and calls query_historical_event with the argument date="2009-08-15". Response: The function retrieves the event: "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours." Full Code from openai import OpenAI import json # Set up API client with the vLLM server settings openai_api_key = <your-deployment-key> openai_api_base = "https://vllm-hf.eastus2.inference.ml.azure.com/v1/" client = OpenAI(api_key=openai_api_key, base_url=openai_api_base) def query_historical_event(date): fictional_historical_events = { "1805-03-21": "On March 21, 1805, the Treaty of Varis signed by several European powers established the first coordinated effort to protect migratory bird species.", "1898-07-10": "On July 10, 1898, the Great Illumination Act was passed in London, mandating the installation of electric streetlights across all major cities in the United Kingdom.", "1923-09-05": "On September 5, 1923, the International Academy of Innovation was founded in Zurich, Switzerland, promoting global collaboration in scientific research.", "1940-02-14": "On February 14, 1940, the first underwater train tunnel connecting two countries was completed between France and the United Kingdom.", "1954-11-08": "On November 8, 1954, the Global Weather Watch Program was launched, pioneering the use of satellites for monitoring Earth's climate systems.", "1977-06-30": "On June 30, 1977, the first fully solar-powered town, Solaria, was inaugurated in Arizona, setting a benchmark for renewable energy communities.", "1983-12-12": "On December 12, 1983, the Universal Language Project introduced a simplified global auxiliary language intended to foster cross-cultural communication.", "1994-04-23": "On April 23, 1994, the Oceanic Research Pact was signed, marking a commitment by 40 nations to share oceanographic research and preserve marine ecosystems.", "2009-08-15": "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours.", "2020-01-10": "On January 10, 2020, the World Clean Air Initiative achieved its milestone goal of reducing urban air pollution levels in 50 major cities globally." } return fictional_historical_events.get(date, f"No historical event information available for {date}.") tools = [ { "function": { "name": "query_historical_event", "description": "Provides information about a historical event that occurred on a specified date.", "parameters": { "type": "object", "properties": { "date": { "type": "string", "description": "The date of the event in YYYY-MM-DD format." }, }, "required": ["date"] } } } ] messages = [ {"role": "system", "content": "You are a knowledgeable assistant that can retrieve information about historical events."}, {"role": "user", "content": "Can you tell me what happened on August 15, 2009?"}, ] chat_response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=messages, temperature=0.7, max_tokens=1024, top_p=0.9, frequency_penalty=0.5, presence_penalty=0.6, tools=tools, tool_choice='auto' ) if chat_response.choices[0].message.tool_calls: date_argument = json.loads(chat_response.choices[0].message.tool_calls[0].function.arguments) date = date_argument.get("date", None) response = query_historical_event(date) print("Assistant response:", response) else: print("Assistant response:", chat_response.choices[0].message.content) Response: Tool has been called with date: 2009-08-15 Assistant response: On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours You've successfully implemented function calling using your deployed Llama-3.1-8B model. Conclusion To wrap up, deploying large language models on vLLM with Azure Machine Learning Managed Online Endpoints is a simple and effective way to enable real-time AI-powered applications. By following the steps shared—from setting up the environment to testing the deployment—you can quickly integrate advanced models like Llama-3.1-8B-Instruct into your workflows. With vLLM's optimized performance and support for function calling, your applications can handle complex tasks and interact with other systems seamlessly. This setup helps you build smarter, faster, and more scalable AI solutions.877Views0likes0Comments