machine learning
107 TopicsFine-Tuning DeepSeek-R1-Distill-Llama-8B with PyTorch FSDP, QLoRA on Azure Machine Learning
Large Language Models (LLMs) have demonstrated remarkable capabilities across various industries, revolutionizing how we approach tasks like legal document summarization, creative content generation, and customer sentiment analysis. However, adapting these general-purpose models to excel in specific domains often requires fine-tuning. This is where fine-tuning comes in, allowing us to tailor LLMs to meet unique requirements and improve their performance on targeted tasks. In this blog post, we'll explore the process of fine-tuning the DeepSeek-R1-Distill-Llama-8B model, highlighting the advantages of using PyTorch Fully Sharded Data Parallel (FSDP) and Quantization-Aware Low-Rank Adaptation (QLoRA) techniques in conjunction with the Azure Machine Learning platform. Why Fine-Tuning Matters In some cases, LLMs may not perform well on specific domains, tasks, or datasets, or may produce inaccurate or misleading outputs. In such cases, fine-tuning the model can be a useful technique to adapt it to the desired goal and improve its quality and reliability. Hallucinations: Hallucinations are untrue statements output by the model. They can harm the credibility and trustworthiness of your application. One possible mitigation is fine-tuning the model with data that contains accurate and consistent information. Accuracy and quality problems: Pre-trained models may not achieve the desired level of accuracy or quality for a specific task or domain. This shortfall can be due a mismatch between the pre-training data and the target data, the diversity and complexity of the target data, and/or incorrect evaluation metrics and criteria. DeepSeek-R1 is an open-source language model excelling in text-based tasks, including creative writing, question answering, editing, and summarization. It's particularly strong in reasoning-intensive tasks like coding, math, and explaining scientific concepts. DeepSeek-R1 stands out due to its mixture of experts (MoE) architecture and use of reinforcement learning, achieving high performance with greater efficiency and lower costs compared to other models. It has 671 billion parameters across multiple expert networks, but only 37 billion are required for a single forward pass. DeepSeek-R1 uses reinforcement learning (RL) to generate a chain-of-thought (CoT) before delivering its final answer. To make these capabilities more accessible, DeepSeek has distilled its R1 outputs into several smaller models. DeepSeek has also created smaller, distilled versions based on Qwen and Llama architectures. Qwen-based distilled models: 1.5B, 7B, 14B and 32B Llama-based distilled models: 8B and 70B DeepSeek-R1-Distill-Llama-8B is a distilled large language model (LLM) based on the Llama architecture, created using outputs from the larger DeepSeek-R1 model. Through knowledge distillation, the reasoning patterns of the larger 671 billion parameter DeepSeek-R1 model are transferred into a smaller, more efficient model. The DeepSeek-R1-Distill-Llama-8B has only 8 billion parameters, making it computationally efficient while retaining a significant portion of the original model's performance. It is fine-tuned from models like Llama-3.1-8B-Instruct, achieving high performance across multiple benchmarks. This distilled model offers a balance of performance and resource requirements, improving inference speed and reducing computational costs, making it cost-effective for production deployments. PyTorch FSDP: Scaling Fine-Tuning with Data Parallelism PyTorch Fully Sharded Data Parallel (FSDP) is a distributed training framework that addresses the challenges of fine-tuning large models by sharding model parameters, optimizer states, and gradients across multiple GPUs. This technique enables you to train models with billions of parameters on systems with limited GPU memory. QLoRA: Efficient Fine-Tuning with Quantization and Low-Rank Adaptation Quantization-Aware Low-Rank Adaptation (QLoRA) is a parameter-efficient fine-tuning technique that reduces memory usage and accelerates training by quantizing the model weights and fine-tuning only a small subset of parameters. QLoRA leverages Low-Rank Adaptation (LoRA) to fine-tune only a small subset of the model’s parameters, making training faster and memory efficient. Azure Machine Learning: Your Platform for Scalable Fine-Tuning Azure Machine Learning provides a robust platform for fine-tuning LLMs, offering a comprehensive suite of tools and services to streamline the process. Scalable Compute: Azure Machine Learning Compute provides virtual machines (VMs) that run parts of the distributed deep learning job, auto-scaling as necessary. Azure Machine Learning compute clusters can schedule tasks, collect results, adjust resources to actual loads, and manage errors[5]. VMs that participate in the cluster can be GPU-enabled to accelerate deep learning calculations. Data Storage: Azure offers standard and premium blob storage options for storing training data and execution logs. Premium blob storage is used to store training data and enable high-performance access during model training, which is needed for distributed training. Experiment Tracking: Azure Machine Learning provides tools for tracking and managing your fine-tuning experiments, allowing you to monitor performance metrics and reproduce your results. Hands-on lab Now let’s start finetune and deploy the same on AML. Lets sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. It imports necessary libraries and handles exceptions during the ML client initialization. # import required libraries """ This script sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. It imports necessary libraries and handles exceptions during the ML client initialization. Modules imported: - time: Provides various time-related functions. - azure.identity: Provides authentication capabilities with DefaultAzureCredential and InteractiveBrowserCredential. - azure.ai.ml: Contains classes and functions for interacting with Azure ML services, including MLClient, Input, pipeline, load_component, command, Data, Environment, BuildContext, Model, Input, Output, and AssetTypes. - azure.core.exceptions: Contains exceptions for handling resource-related errors. - os: Provides a way to interact with the operating system. Variables: - credential: An instance of DefaultAzureCredential used for authenticating with Azure services. - ml_client: An instance of MLClient initialized using the provided credentials. If the initialization fails, an exception is caught and printed. """ import time from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential from azure.ai.ml import MLClient, Input from azure.ai.ml.dsl import pipeline from azure.ai.ml import load_component from azure.ai.ml import command from azure.ai.ml.entities import Data, Environment, BuildContext from azure.ai.ml.entities import Model from azure.ai.ml import Input from azure.ai.ml import Output from azure.ai.ml.constants import AssetTypes from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError import os credential = DefaultAzureCredential() ml_client = None try: ml_client = MLClient.from_config(credential) except Exception as ex: print(ex) Now lets install some libraries required to download the dataset and run the openai client. %conda run -n azureml_py310_sdkv2 pip install datasets==3.2.0 openai Lets create our training environment. os.makedirs("environment_train", exist_ok=True) Lets build our docker environment. %%writefile environment_train/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch22x:biweekly.202501.3 USER root # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update && apt-get -y upgrade RUN pip install --upgrade pip RUN apt-get install -y openssh-server openssh-client # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation Let’s also specify our requirements.txt %%writefile environment_train/requirements.txt transformers==4.48.2 peft==0.14.0 accelerate==1.3.0 bitsandbytes==0.45.1 datasets==3.2.0 evaluate==0.4.3 huggingface_hub[hf_transfer] safetensors>=0.5.2 sentencepiece==0.2.0 scikit-learn==1.6.1 tokenizers>=0.21.0 py7zr Once we specify both lets create the AML custom training environment. env_name = "deepseek-training" env_docker_image = Environment( build=BuildContext(path = "environment_train", dockerfile_path="Dockerfile"), name=env_name, description="Environment created for llm fine-tuning.", version="1" ) env_asset_train = ml_client.environments.create_or_update(env_docker_image) While the training environment is ready let’s start with the dataset preparation. from datasets import load_dataset import pandas as pd dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en") df = pd.DataFrame(dataset['train']) df = df.iloc[0:2000] df.head() Here is quick snapshot of what the dataset looks like Noe lets split the dataset into train and test for validation. from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.1, random_state=42) print("Number of train elements: ", len(train)) print("Number of test elements: ", len(test)) Let’s create the prompt template to run the finetuning process. In this case we have used COT prompt template. # custom instruct prompt start prompt_template = f""" <|begin▁of▁sentence|> You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response. <|User|> {{question}} <|Assistant|> <think> {{complex_cot}} </think> {{answer}} <|end▁of▁sentence|> """ # template dataset to add prompt to each sample def template_dataset(sample): sample["text"] = prompt_template.format(question=sample["Question"], complex_cot=sample["Complex_CoT"], answer=sample["Response"]) return sample Let’s run the mapping of this prompt through the whole dataset and create train and test jsonl files.. from datasets import Dataset, DatasetDict from random import randint train_dataset = Dataset.from_pandas(train) test_dataset = Dataset.from_pandas(test) dataset = DatasetDict({"train": train_dataset, "test": test_dataset}) train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features)) print(train_dataset[randint(0, len(dataset))]["text"]) test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features)) train_dataset.to_json(f"data/train.jsonl") test_dataset.to_json(f"data/eval.jsonl") Now let’s start creating our training script. os.makedirs("src_train", exist_ok=True) write the train.py which uses both Qlora and PyTorch FSDP. %%writefile src_train/train.py import os import argparse import sys import logging from accelerate import Accelerator import datetime from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed import transformers import traceback from huggingface_hub import snapshot_download from datasets import load_dataset def download_model(model_name): print("Downloading model ", model_name) os.makedirs("/tmp/tmp_folder", exist_ok=True) snapshot_download(repo_id=model_name, local_dir="/tmp/tmp_folder") print(f"Model {model_name} downloaded under /tmp/tmp_folder") def init_distributed(): # Initialize the process group torch.distributed.init_process_group( backend="nccl", # Use "gloo" backend for CPU timeout=datetime.timedelta(seconds=5400) ) local_rank = int(os.environ["LOCAL_RANK"]) torch.cuda.set_device(local_rank) return local_rank def main(args): model_name = args.model_name_or_path train_ds = load_dataset('json', data_files=args.train_file, split='train') test_ds = load_dataset('json', data_files=args.eval_file, split='train') per_device_train_batch_size=args.train_batch_size per_device_eval_batch_size=args.eval_batch_size gradient_accumulation_steps=args.grad_accum_steps learning_rate=args.learning_rate num_train_epochs=args.epochs lora_r=8 lora_alpha=16 lora_dropout=0.1 fsdp="full_shard auto_wrap offload" fsdp_config={ 'backward_prefetch': 'backward_pre', 'cpu_ram_efficient_loading': True, 'offload_params': True, 'forward_prefetch': False, 'use_orig_params': False } gradient_checkpointing=False merge_weights=True seed=42 token=None model_dir = args.model_dir if torch.cuda.is_available() and (torch.cuda.device_count() > 1 or int(os.environ.get("SM_HOST_COUNT", 1)) > 1): # Call this function at the beginning of your script local_rank = init_distributed() # Now you can use distributed functionalities torch.distributed.barrier(device_ids=[local_rank]) os.environ.update({"HF_HUB_ENABLE_HF_TRANSFER": "1"}) set_seed(seed) accelerator = Accelerator() if token is not None: os.environ.update({"HF_TOKEN": token}) accelerator.wait_for_everyone() if int(os.environ.get("SM_HOST_COUNT", 1)) == 1: if accelerator.is_main_process: download_model(model_name) else: download_model(model_name) accelerator.wait_for_everyone() model_name = "/tmp/tmp_folder" tokenizer = AutoTokenizer.from_pretrained(model_name) # Set Tokenizer pad Token tokenizer.pad_token = tokenizer.eos_token with accelerator.main_process_first(): # tokenize and chunk dataset lm_train_dataset = train_ds.map( lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features) ) print(f"Total number of train samples: {len(lm_train_dataset)}") if test_ds is not None: lm_test_dataset = test_ds.map( lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features) ) print(f"Total number of test samples: {len(lm_test_dataset)}") else: lm_test_dataset = None torch_dtype = torch.bfloat16 # Defining additional configs for FSDP if fsdp != "" and fsdp_config is not None: bnb_config_params = { "bnb_4bit_quant_storage": torch_dtype } model_configs = { "torch_dtype": torch_dtype } fsdp_configurations = { "fsdp": fsdp, "fsdp_config": fsdp_config, "gradient_checkpointing_kwargs": { "use_reentrant": False }, "tf32": True } else: bnb_config_params = dict() model_configs = dict() fsdp_configurations = dict() bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch_dtype, **bnb_config_params ) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, quantization_config=bnb_config, attn_implementation="flash_attention_2", use_cache=not gradient_checkpointing, cache_dir="/tmp/.cache", **model_configs ) if fsdp == "" and fsdp_config is None: model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing) if gradient_checkpointing: model.gradient_checkpointing_enable() config = LoraConfig( r=lora_r, lora_alpha=lora_alpha, target_modules="all-linear", lora_dropout=lora_dropout, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, config) trainer = transformers.Trainer( model=model, train_dataset=lm_train_dataset, eval_dataset=lm_test_dataset if lm_test_dataset is not None else None, args=transformers.TrainingArguments( per_device_train_batch_size=per_device_train_batch_size, per_device_eval_batch_size=per_device_eval_batch_size, gradient_accumulation_steps=gradient_accumulation_steps, gradient_checkpointing=gradient_checkpointing, logging_strategy="steps", logging_steps=1, log_on_each_node=False, num_train_epochs=num_train_epochs, learning_rate=learning_rate, bf16=True, ddp_find_unused_parameters=False, save_strategy="no", output_dir="outputs", **fsdp_configurations ), data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), ) if trainer.accelerator.is_main_process: trainer.model.print_trainable_parameters() trainer.train() if trainer.is_fsdp_enabled: trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT") if merge_weights: output_dir = "/tmp/model" # merge adapter weights with base model and save # save int 4 model trainer.model.save_pretrained(output_dir, safe_serialization=False) if accelerator.is_main_process: # clear memory del model del trainer torch.cuda.empty_cache() # load PEFT model model = AutoPeftModelForCausalLM.from_pretrained( output_dir, torch_dtype=torch.float16, low_cpu_mem_usage=True, trust_remote_code=True, ) # Merge LoRA and base model and save model = model.merge_and_unload() model.save_pretrained( model_dir, safe_serialization=True, max_shard_size="2GB" ) else: trainer.model.save_pretrained( model_dir, safe_serialization=True ) if accelerator.is_main_process: tokenizer.save_pretrained(model_dir) accelerator.wait_for_everyone() def parse_args(): # setup argparse parser = argparse.ArgumentParser() # curr_time = datetime.now().strftime("%Y-%m-%d_%H:%M:%S") # hyperparameters parser.add_argument("--model_name_or_path", default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", type=str, help="Input directory for training") parser.add_argument("--train_file", type=str, help="Input data for training") parser.add_argument("--eval_file", type=str, help="Input data for eval") parser.add_argument("--epochs", default=1, type=int, help="number of epochs") parser.add_argument("--train_batch_size", default=2, type=int, help="training - mini batch size for each gpu/process") parser.add_argument("--eval_batch_size", default=4, type=int, help="evaluation - mini batch size for each gpu/process") parser.add_argument("--grad_accum_steps", default=4, type=int, help="gradient accumulation steps") parser.add_argument("--learning_rate", default=2e-4, type=float, help="learning rate") parser.add_argument("--save_merged_model", type=bool, default=False) parser.add_argument("--model_dir", type=str, default="./", help="output directory for model") # parse args args = parser.parse_args() # return args return args if __name__ == "__main__": #sys.argv = [''] args = parse_args() main(args) Next step is to create a compute cluster on which the training will run. azure_compute_cluster_name = "a100-compute" azure_compute_cluster_size = "Standard_NC24ads_A100_v4" USE_LOWPRIORITY_VM = True from azure.ai.ml.entities import AmlCompute ### Create the compute cluster try: compute = ml_client.compute.get(azure_compute_cluster_name) except Exception as ex: try: tier = "LowPriority" if USE_LOWPRIORITY_VM else "Dedicated" compute = AmlCompute( name=azure_compute_cluster_name, size=azure_compute_cluster_size, tier=tier, max_instances=1, # For multi node training set this to an integer value more than 1 ) ml_client.compute.begin_create_or_update(compute).wait() except Exception as e: print(e) Once the compute is ready, lets run the training job. from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration str_command = "" str_command += "python train.py --train_file ${{inputs.train_file}} --eval_file ${{inputs.eval_file}} \ --epochs ${{inputs.epoch}} --train_batch_size ${{inputs.train_batch_size}} \ --eval_batch_size ${{inputs.eval_batch_size}} --model_name_or_path ${{inputs.model_name_or_path}} \ --model_dir ${{inputs.model_dir}} --save_merged_model ${{inputs.save_merged_model}}" job = command( inputs=dict( train_file=Input( type="uri_file", path="data/train.jsonl", ), eval_file=Input( type="uri_file", path="data/eval.jsonl", ), epoch=1, train_batch_size=2, eval_batch_size=1, model_name_or_path="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", model_dir="./outputs", save_merged_model = True ), code="./src_train", # local path where the code is stored compute=azure_compute_cluster_name, command=str_command, environment=env_asset_train, distribution={ "type": "PyTorch", "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1 }, ) returned_job = ml_client.jobs.create_or_update(job) ml_client.jobs.stream(returned_job.name) Once the training is completed, lets register the model as a custom model type. from azure.ai.ml.entities import Model from azure.ai.ml.constants import AssetTypes run_model = Model( path=f"azureml://jobs/{returned_job.name}/outputs/artifacts/paths/outputs/", name="deepseekr1-dist-llama8bft", description="Model created from run.", type=AssetTypes.CUSTOM_MODEL, ) model = ml_client.models.create_or_update(run_model) Once the model is registered the next step is to deploy the same as Online Managed Endpoint. from azure.ai.ml.entities import ( ManagedOnlineEndpoint, IdentityConfiguration, ManagedIdentityConfiguration, ) endpoint_name = "deepseekr1-dist-llama8bft-ep" # Check if the endpoint already exists in the workspace try: endpoint = ml_client.online_endpoints.get(endpoint_name) print("---Endpoint already exists---") except: # Create an online endpoint if it doesn't exist # Define the endpoint endpoint = ManagedOnlineEndpoint( name=endpoint_name, description=f"Test endpoint for {model.name}" ) # Trigger the endpoint creation try: ml_client.begin_create_or_update(endpoint).wait() print("\n---Endpoint created successfully---\n") except Exception as err: raise RuntimeError( f"Endpoint creation failed. Detailed Response:\n{err}" ) from err Let’s define the deployment name , SKU type of the VM and Request timeout parameter. # Initialize deployment parameters deployment_name = "deepseekr1-dist-llama8bftd-eploy" sku_name = "Standard_NC24ads_A100_v4" REQUEST_TIMEOUT_MS = 90000 os.makedirs("environment_inf", exist_ok=True) Lets create the environment for our inference . %%writefile environment_inf/Dockerfile FROM vllm/vllm-openai:latest ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_NAME $VLLM_ARGS Let’s build the environment with the docker file created above. from azure.ai.ml.entities import Environment, BuildContext env_docker_image = Environment( build=BuildContext(path="environment_inf", dockerfile_path= "Dockerfile"), name="vllm-custom", description="Environment created from a Docker context.", inference_config={ "liveness_route": { "port": 8000, "path": "/health", }, "readiness_route": { "port": 8000, "path": "/health", }, "scoring_route": { "port": 8000, "path": "/", }, }, ) env_asset_inf = ml_client.environments.create_or_update(env_docker_image) Once our environment for inference server is ready let’s do the deployment. Lets define some environment variables model_path = f"/var/azureml-app/azureml-models/{model.name}/{model.version}/outputs" env_vars = { "MODEL_NAME": model_path, "VLLM_ARGS": "--max-model-len 16000 --enforce-eager", } deployment_env_vars = {**env_vars} Lets do the deployment now. import time from azure.ai.ml.entities import ( OnlineRequestSettings, CodeConfiguration, ManagedOnlineDeployment, ProbeSettings, Environment ) t0 = time.time() deployment = ManagedOnlineDeployment( name= deployment_name, endpoint_name=endpoint_name, model=model, instance_type=sku_name, instance_count=1, environment_variables=deployment_env_vars, environment=env_asset_inf, request_settings=OnlineRequestSettings( max_concurrent_requests_per_instance=2, request_timeout_ms=50000, max_queue_wait_ms=60000 ), liveness_probe=ProbeSettings( failure_threshold=5, success_threshold=1, timeout=10, period=30, initial_delay=120 ), readiness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, timeout=2, period=10, initial_delay=120, ), ) # Trigger the deployment creation try: ml_client.begin_create_or_update(deployment).wait() except Exception as err: raise RuntimeError( f"Deployment creation failed. Detailed Response:\n{err}" ) from err endpoint.traffic = {deployment_name: 100} endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint) Wow!! Our endpoint is now deployed. Let’s start testing the same. endpoint_results = endpoint_poller.result() endpoint_name = endpoint_results.name keys = ml_client.online_endpoints.get_keys(name=endpoint_name) primary_key = keys.primary_key url = os.path.join(endpoint_results.scoring_uri, "v1") endpoint_name = ( endpoint_results.name if endpoint_name is None else endpoint_name ) keys = ml_client.online_endpoints.get_keys(name=endpoint_name) Once we get the API keys we can use openai client to stream the tokens. from openai import OpenAI vllm_client = OpenAI(base_url=url, api_key=primary_key) # Create your prompt system_message = """You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.""" user_message = f"""A 3-week-old child has been diagnosed with late onset perinatal meningitis, and the CSF culture shows gram-positive bacilli. What characteristic of this bacterium can specifically differentiate it from other bacterial agents?""" response = vllm_client .chat.completions.create( model=model_path, messages=[ {"role": "system", "content": system_message}, {"role": "user", "content": user_message}, ], temperature=0.7, max_tokens=4000, stream=True, # Stream the response ) print("Streaming response:") for chunk in response: delta = chunk.choices[0].delta if hasattr(delta, "content"): print(delta.content, end="", flush=True) Conclusion Fine-tuning the DeepSeek-R1-Distill-Llama-8B model with PyTorch FSDP and QLoRA on Azure Machine Learning offers a powerful approach to customising LLMs for specific tasks. By leveraging the scalability and efficiency of these techniques, you can unlock the full potential of LLMs and drive innovation in your respective domain. Hope you liked the blog. Do like the blog and follow me for more such content. Thanks Manoranjan Rajguru AI Global Black Belt3KViews0likes0CommentsScalable and Efficient Fine-Tuning of LLM on Azure ML
https://github.com/james-tn/llm-fine-tuning/tree/main/opensource_llm/single_step Co-Author: Mohamad AL jazaery Why Scalable and Efficient Fine-Tuning Matters Faster Iterations, Shorter Time-to-Value: In today’s competitive AI landscape, time is of the essence. The faster you can fine-tune a model, the quicker you can validate ideas, test hypotheses, and bring solutions to market. High-profile GPU machines are costly: High-performance GPUs and compute clusters don’t come cheap, and their availability is often limited. Efficient fine-tuning techniques, such as model sharding and distributed training, maximize the utilization of these precious resources—ensuring that you get the most out of your infrastructure investment. Choosing the Right Azure ML GPU Compute for the Job: NC or ND? Not all GPU computes are created equal, and choosing the right sku can make or break your training efficiency. ND Series: Ideal for distributed training across multiple nodes, thanks to its Infiniband (IB) connectivity that ensures high-speed communication between nodes like pretraining LLM or finetuning very large model ~70B params. NC Series: Small and medium workload where no heavy interaction between nodes needed like LLM inferencing or mid-size LLM finetuning. Azure GPU Machine Options by Scenario: Scenario Common model size Training Approach Recommended Azure Compute Small-scale fine-tuning < 3B parameters Parameter-efficient tuning NCas_T4_v3 (Tesla T4, 16 GB) Medium-scale fine-tuning 1–5B parameters Full or parameter-efficient NCs_v3 (Tesla V100, 16 GB) Distributed training for medium models 5–10B parameters Full fine-tuning ND_v2 (Tesla V100 NVLINK, 32 GB, InfiniBand) Large-scale fine-tuning (single machine) 10–30B parameters Full or parameter-efficient NC_A100_v4 (A100, 40 GB) Distributed training for very large models 20–70B parameters Full fine-tuning NDasrA100_v4 (A100, 80 GB, HDR InfiniBand) Very large models training (single machine) up to 70B parameters Full or parameter-efficient NCads_H100_v5 (H100 NVL, 94 GB) Massive-scale distributed training > 70B parameters Full fine-tuning ND-H100-v5 (H100, 80 GB, scale-out InfiniBand) Distributed Efficient Training: A Quick Guide When scaling fine-tuning tasks, choosing the right distributed training method is key: DDP (Data Parallelism): Works well when the entire model fits on a single GPU. It replicates the model across multiple GPUs and splits the data for parallel processing. Check experiment 1 in the following section. Model Parallelism: A game-changer for massive models that don’t fit on a single GPU. It shards not only the data but also the model parameters and optimizer states across multiple GPUs, enabling efficient training of models like LLaMA-70B on GPUs with low memory GPUs. Both FSDP and DeepSpeed as libraries excel at implementing advanced forms of model parallelism and memory optimization. Memory Optimization Techniques Gradient Checkpointing: Reduces memory by recomputing activations during the backward pass, trading memory for additional computation. Mixed Precision Training: Reduces memory usage by using FP16 or BF16 instead of FP32, accelerating training while maintaining numerical stability. Supported by both frameworks. Quantization (DeepSpeed Exclusive): Uses INT8 precision for weights and activations, dramatically reducing memory and compute requirements. Offloading (DeepSpeed Exclusive): Offloads optimizer states and model parameters to CPU or NVMe, freeing up GPU memory for computation. Our Experiments: Pushing the Limits of Scalability Experiment 1: Distributed Training on Multiple Nodes using DDP We conducted an experiment to fine-tune the Llama-3.1-8B model using LoRA (Low-Rank Adaptation) on Azure ML NDv2-V100 nodes. The goal was to evaluate the efficiency of fine-tuning across different numbers of nodes (1, 2, and 3) and observe the impact on training time and throughput. Azure ML Job YAML Definition $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml/models/mistralai-Mistral-7B-v01/versions/19 command: > accelerate launch --num_processes 16 # gpu per machine * num of machines --num_machines 2 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT compute: azureml:ndv2-cluster resources: instance_count: 2 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Results: As you increased the number of nodes from one to three, the throughput increased proportionally. This indicates that the system scaled efficiently with the addition of more nodes, maintaining a close-to-linear improvement in throughput. Experiment 2: Model Parallelism using FSDP Fine-tuning a 70B-parameter model on GPUs with only 16GB of memory might sound impossible, but we made it happen using FSDP (Full Sharded Data Parallelism) on Azure ML using a cluster of multiple NDv2-V100 nodes. By distributing not only the data but also the model parameters and optimizer states across multiple nodes, we unlocked the power of full sharding. $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml-meta/models/Llama-3.3-70B-Instruct/versions/4 command: > accelerate launch --config_file "configs/fsdp_config.yaml" --num_processes 32 --num_machines 4 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT train.py compute: azureml:ndv2-cluster resources: instance_count: 4 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Key Takeaways: Memory Efficiency: Full sharding enabled us to fine-tune the LLaMA-70B model on V100 GPUs despite their limited memory. Connectivity Matters: The Infiniband (IB) connectivity of ND nodes played a critical role in ensuring smooth communication across GPUs, making this feat possible. Conclusion Scalable and efficient fine-tuning is the key to unlocking the true potential of Large Language Models. By leveraging distributed training techniques, such as FSDP and DDP, and optimizing compute resources on Azure ML, researchers and practitioners can overcome the challenges of training massive models—reducing costs, accelerating time-to-value, and driving AI innovation. Access the code and start experimenting here! Future work: The second part will focus on real-world pipeline setups, including end-to-end model training, hyperparameter optimization, and testing. The third part will dive into deploying trained models for practical use. Future posts may explore best practices for specific fine-tuning scenarios and techniques.1.2KViews3likes0CommentsUnlocking Function Calling with vLLM and Azure Machine Learning
Introduction In this post, we’ll explain how to deploy LLMs on vLLM using Azure Machine Learning’s Managed Online Endpoints for efficient, scalable, and secure real-time inference. Next, we will look at function calling, and how vLLM's engine can support you to achieve that. To get started, let’s briefly look into what vLLM and Managed Online Endpoints are. You can find the full code examples on vllm-on-azure-machine-learning. vLLM vLLM is a high-throughput and memory-efficient inference and serving engine designed for large language models (LLMs). It optimizes the serving and execution of LLMs by utilizing advanced memory management techniques, such as PagedAttention, which efficiently manages attention key and value memory. This allows for continuous batching of incoming requests and fast model execution, making vLLM a powerful tool for deploying and serving LLMs at scale. vLLM supports seamless integration with popular Hugging Face models and offers various decoding algorithms, including parallel sampling and beam search. It also supports tensor parallelism and pipeline parallelism for distributed inference, making it a flexible and easy-to-use solution for LLM inference (see full docs). Managed Online Endpoints in Azure Machine Learning Managed Online Endpoints in Azure Machine Learning provide a streamlined and scalable way to deploy machine learning models for real-time inference. These endpoints handle the complexities of serving, scaling, securing, and monitoring models, allowing us to focus on building and improving your models without worrying about infrastructure management. HuggingFace Model Deployment Let’s go through deploying a HuggingFace model on Azure Machine Learning’s Managed Online Endpoints. For this, we’ll use a custom Dockerfile and configuration files to set up the deployment. As a model, we’ll be using meta-llama/Llama-3.1-8B-Instruct on a single Standard_NC24ads_A100_v4 instance. Step 1: Create a custom Environment for vLLM on AzureML First, we create a Dockerfile to define the environment for our model. For this, we’ll be using vllm’s base container that has all the dependencies and drivers included: FROM vllm/vllm-openai:latest ENV MODEL_NAME facebook/opt-125m ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_NAME $VLLM_ARGS The idea here is that we can pass a model name via an ENV variable, so that we can easily define which model we want to deploy during deployment time. Next, we log into our Azure Machine Learning workspace: az account set --subscription <subscription ID> az configure --defaults workspace=<Azure Machine Learning workspace name> group=<resource group> Now, we create an environment.yml file to specify the environment settings: $schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json name: vllm build: path: . dockerfile_path: Dockerfile Then let’s build the environment: az ml environment create -f environment.yml Step 2: Deploy the AzureML Managed Online Endpoint Time for deployment, so let’s first create an endpoint.yml file to define the Managed Online Endpoint: $schema: https://azuremlsdk2.blob.core.windows.net/latest/managedOnlineEndpoint.schema.json name: vllm-hf auth_mode: key Let’s create it: az ml online-endpoint create -f endpoint.yml For the next step, we’ll need the address of the Docker image address we created. We can quickly get it from AzureML Studio -> Environments -> vllm: Finally, we create a `deployment.yml file to configure the deployment settings and deploy our desired model from HuggingFace via vLLM: $schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json name: current endpoint_name: vllm-hf environment_variables: MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct # define the model name using the identifier from HG VLLM_ARGS: "--enable-auto-tool-choice --tool-call-parser llama3_json" HUGGING_FACE_HUB_TOKEN: xxxxxxxxxxxxxx # Add your HF API key here environment: image: xxxxxxxxx.azurecr.io/azureml/azureml_xxxxxxxxxxx # Replace with your own image inference_config: liveness_route: port: 8000 path: /ping readiness_route: port: 8000 path: /health scoring_route: port: 8000 path: / instance_type: Standard_NC24ads_A100_v4 instance_count: 1 request_settings: request_timeout_ms: 60000 max_concurrent_requests_per_instance: 16 liveness_probe: initial_delay: 10 period: 10 timeout: 2 success_threshold: 1 failure_threshold: 30 readiness_probe: initial_delay: 120 period: 10 timeout: 2 success_threshold: 1 failure_threshold: 30 Since vLLM does not support separate probes for readiness and liveness, we’ll need to make sure that the model has fully loaded before the fire the first probe. This is why we increased readiness_probe.initial_delay to 120s. For larger models, we should also follow vLLM’s documentation for using tensor parallel inference (model on single node but spanning multiple GPUs) by adding --tensor-parallel-size <NUM_OF_GPUs> to VLLM_ARGS. Since we’re using a single A100 GPU in our example (Standard_NC24ads_A100_v4), this is not required though. The request_settings depend a bit on our instance type/size and might require some manual tuning to get the model run properly and efficiently. Goal is to find a good tradeoff between concurrency (max_concurrent_requests_per_instance) and queue time in order to avoid either hitting request_timeout_ms from the endpoint side, or any HTTP-timeouts on the client side. Both these scenarios result in HTTP 429, and the client would need to implement exponential backoff (e.g. via tenacity library). Lastly, we can deploy the model: az ml online-deployment create -f deployment.yml --all-traffic By following these steps, we have deployed a HuggingFace model on Azure Machine Learning’s Managed Online Endpoints, ensuring efficient and scalable real-time inference. Time to test it! Step 3: Testing the deployment# First, let’s get the endpoint’s scoring uri and the api keys: az ml online-endpoint show -n vllm-hf az ml online-endpoint get-credentials -n vllm-hf For completion models, we can then call the endpoint using this Python code snippet: import requests url = "https://vllm-hf.polandcentral.inference.ml.azure.com/v1/completions" headers = { "Content-Type": "application/json", "Authorization": "Bearer xxxxxxxxxxxx" } data = { "model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "San Francisco is a", "max_tokens": 200, "temperature": 0.7 } response = requests.post(url, headers=headers, json=data) print(response.json()) Response: { "id": "cmpl-98d658cf-6310-4c87-a24f-723dda6db176", "object": "text_completion", "created": 1738267352, "model": "meta-llama/Llama-3.1-8B-Instruct", "choices": [ { "index": 0, "text": " top tourist destination known for its iconic Golden Gate Bridge, steep hills, vibrant neighborhoods, and cultural attractions. The city is a haven for foodies, with a diverse range of cuisines available, from seafood to Mexican to Chinese and more.\nOne of the best ways to experience San Francisco is by taking a ride on a historic cable car, which offers stunning views of the city and its surroundings. Explore the historic Fisherman's Wharf, a bustling waterfront district filled with seafood restaurants, street performers, and souvenir shops.\nVisit the vibrant neighborhoods of Haight-Ashbury and the Mission District, known for their colorful street art, independent shops, and lively music scenes. Take a stroll through Golden Gate Park, a sprawling urban park that features gardens, lakes, and walking and biking trails.\n\nThe city has a thriving arts and culture scene, with numerous museums, galleries, and performance venues. The San Francisco Museum of Modern Art (SFMOMA) is one of the largest modern art museums in", "logprobs": null, "finish_reason": "length", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 205, "completion_tokens": 200, "prompt_tokens_details": null } } Works! Function Calling Function calling in the context of large language models (LLMs) refers to the model's ability to dynamically generate and call structured functions based on context, user input, or specific task requirements. It enables seamless interaction with APIs, databases, or external tools while leveraging the model's reasoning capabilities. vLLM provides an OpenAI-compatible server that supports the Completions, Chat Completions, and Embeddings APIs. For instance, it enables developers to seamlessly integrate models into existing workflows. Developers can use the official OpenAI Python client or any HTTP client to interact with vLLM, making it straightforward to integrate into existing workflows. Before running the code, ensure you have the OpenAI library installed by executing: pip install openai The following code demonstrates the function-calling capabilities of vLLM using an example where the assistant retrieves information about historical events based on a provided date: Lets go through it step by step 1. Defining a Custom Function: A query_historical_event function is defined, containing a dictionary of fictional historical events. This function serves as a callable endpoint for vLLM to retrieve information based on a user-specified date. def query_historical_event(date): fictional_historical_events = { "1805-03-21": "On March 21, 1805, the Treaty of Varis signed by several European powers established the first coordinated effort to protect migratory bird species.", "1898-07-10": "On July 10, 1898, the Great Illumination Act was passed in London, mandating the installation of electric streetlights across all major cities in the United Kingdom.", "1923-09-05": "On September 5, 1923, the International Academy of Innovation was founded in Zurich, Switzerland, promoting global collaboration in scientific research.", "1940-02-14": "On February 14, 1940, the first underwater train tunnel connecting two countries was completed between France and the United Kingdom.", "1954-11-08": "On November 8, 1954, the Global Weather Watch Program was launched, pioneering the use of satellites for monitoring Earth's climate systems.", "1977-06-30": "On June 30, 1977, the first fully solar-powered town, Solaria, was inaugurated in Arizona, setting a benchmark for renewable energy communities.", "1983-12-12": "On December 12, 1983, the Universal Language Project introduced a simplified global auxiliary language intended to foster cross-cultural communication.", "1994-04-23": "On April 23, 1994, the Oceanic Research Pact was signed, marking a commitment by 40 nations to share oceanographic research and preserve marine ecosystems.", "2009-08-15": "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours.", "2020-01-10": "On January 10, 2020, the World Clean Air Initiative achieved its milestone goal of reducing urban air pollution levels in 50 major cities globally." } return fictional_historical_events.get(date, f"No historical event information available for {date}.") 2. Tool Integration: The function is wrapped in a tools definition, which includes metadata such as the function’s name, description, and expected parameters (e.g., the date in YYYY-MM-DD format). tools = [ { "function": { "name": "query_historical_event", "description": "Provides information about a historical event that occurred on a specified date.", "parameters": { "type": "object", "properties": { "date": { "type": "string", "description": "The date of the event in YYYY-MM-DD format." }, }, "required": ["date"] } } } ] 3. Conversation Workflow: The conversation starts with a system message setting the assistant's role and a user query about a specific date. The assistant evaluates the query and decides if the custom function is needed. messages = [ {"role": "system", "content": "You are a knowledgeable assistant that can retrieve information about historical events."}, {"role": "user", "content": "Can you tell me what happened on August 15, 2009?"}, ] 4. Function Call Handling: If the assistant determines that the function is required, it: Parses the function call and extracts the necessary parameters (e.g., date). Executes the query_historical_event function with the provided arguments and returns the result to the user. chat_response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=messages, temperature=0.7, max_tokens=1024, top_p=0.9, frequency_penalty=0.5, presence_penalty=0.6, tools=tools, tool_choice='auto' ) if chat_response.choices[0].message.tool_calls: date_argument = json.loads( chat_response.choices[0].message.tool_calls[0].function.arguments) date = date_argument.get("date", None) response = query_historical_event(date) print("Assistant response:", response) else: print("Assistant response:", chat_response.choices[0].message.content) Example Workflow User Query: "Can you tell me what happened on August 15, 2009?" Assistant Function Call: The assistant identifies the query’s intent and calls query_historical_event with the argument date="2009-08-15". Response: The function retrieves the event: "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours." Full Code from openai import OpenAI import json # Set up API client with the vLLM server settings openai_api_key = <your-deployment-key> openai_api_base = "https://vllm-hf.eastus2.inference.ml.azure.com/v1/" client = OpenAI(api_key=openai_api_key, base_url=openai_api_base) def query_historical_event(date): fictional_historical_events = { "1805-03-21": "On March 21, 1805, the Treaty of Varis signed by several European powers established the first coordinated effort to protect migratory bird species.", "1898-07-10": "On July 10, 1898, the Great Illumination Act was passed in London, mandating the installation of electric streetlights across all major cities in the United Kingdom.", "1923-09-05": "On September 5, 1923, the International Academy of Innovation was founded in Zurich, Switzerland, promoting global collaboration in scientific research.", "1940-02-14": "On February 14, 1940, the first underwater train tunnel connecting two countries was completed between France and the United Kingdom.", "1954-11-08": "On November 8, 1954, the Global Weather Watch Program was launched, pioneering the use of satellites for monitoring Earth's climate systems.", "1977-06-30": "On June 30, 1977, the first fully solar-powered town, Solaria, was inaugurated in Arizona, setting a benchmark for renewable energy communities.", "1983-12-12": "On December 12, 1983, the Universal Language Project introduced a simplified global auxiliary language intended to foster cross-cultural communication.", "1994-04-23": "On April 23, 1994, the Oceanic Research Pact was signed, marking a commitment by 40 nations to share oceanographic research and preserve marine ecosystems.", "2009-08-15": "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours.", "2020-01-10": "On January 10, 2020, the World Clean Air Initiative achieved its milestone goal of reducing urban air pollution levels in 50 major cities globally." } return fictional_historical_events.get(date, f"No historical event information available for {date}.") tools = [ { "function": { "name": "query_historical_event", "description": "Provides information about a historical event that occurred on a specified date.", "parameters": { "type": "object", "properties": { "date": { "type": "string", "description": "The date of the event in YYYY-MM-DD format." }, }, "required": ["date"] } } } ] messages = [ {"role": "system", "content": "You are a knowledgeable assistant that can retrieve information about historical events."}, {"role": "user", "content": "Can you tell me what happened on August 15, 2009?"}, ] chat_response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=messages, temperature=0.7, max_tokens=1024, top_p=0.9, frequency_penalty=0.5, presence_penalty=0.6, tools=tools, tool_choice='auto' ) if chat_response.choices[0].message.tool_calls: date_argument = json.loads(chat_response.choices[0].message.tool_calls[0].function.arguments) date = date_argument.get("date", None) response = query_historical_event(date) print("Assistant response:", response) else: print("Assistant response:", chat_response.choices[0].message.content) Response: Tool has been called with date: 2009-08-15 Assistant response: On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours You've successfully implemented function calling using your deployed Llama-3.1-8B model. Conclusion To wrap up, deploying large language models on vLLM with Azure Machine Learning Managed Online Endpoints is a simple and effective way to enable real-time AI-powered applications. By following the steps shared—from setting up the environment to testing the deployment—you can quickly integrate advanced models like Llama-3.1-8B-Instruct into your workflows. With vLLM's optimized performance and support for function calling, your applications can handle complex tasks and interact with other systems seamlessly. This setup helps you build smarter, faster, and more scalable AI solutions.874Views0likes0CommentsUnlocking the Power of Synthetic Data for Fine-Tuning and Evaluation
In the rapidly evolving field of large language models (LLMs) and small language models (SLMs), fine-tuning and evaluation often present unique challenges. Whether the objective is to optimize models for function-calling use cases or to validate multi-agent workflows, one thing remains constant: the need for high-quality, diverse, and contextually relevant data. But what happens when real-world data is either unavailable, incomplete, or too sensitive to use? Enter synthetic data—a powerful tool for accelerating the journey from experimentation to deployment. In this blog, we’ll explore how synthetic data can address critical challenges, why it’s indispensable for certain scenarios, and how Azure AI’s Evaluator Simulator Package enables seamless generation of synthetic interaction data to simulate user personas and scenarios. The Growing Need for Synthetic Data in LLM Development Fine-tuning or evaluating an LLM/SLM for specific use cases often requires vast amounts of labeled data tailored to the task at hand. However, sourcing such data comes with hurdles: Data Scarcity: Real-world interaction data for niche use cases may not exist in sufficient quantity. Privacy Concerns: User interactions may contain sensitive information, making direct use of this data problematic. Scenario Testing: Real-world data rarely accounts for edge cases or extreme scenarios that models must handle gracefully. Synthetic data solves these problems by creating controlled, customizable datasets that reflect real-world conditions—without the privacy risks or availability constraints. Synthetic Data for Function-Calling Use Cases Function-calling in LLMs involves executing API calls based on natural language inputs. For example, users might ask a travel app to “find flights to Paris under $500.” Fine-tuning models for such use cases requires training them on structured, intent-rich inputs paired with corresponding API call structures. Synthetic data can: Simulate diverse intents: Generate variations of user queries across languages, styles, and preferences. Provide structured outputs: Automatically align these queries with the required API call schema for training or evaluation. Include edge cases: Test how models respond to ambiguous or incomplete queries. Model evaluation post fine-tuning presents another set of challenges where we need trusted data to evaluate the performance. Hence, having synthetic data generated by a superior model followed by human screening filtering out noise can provide a rich and diverse data to compare the performance of fine-tuned vs base models. Synthetic Data in Multi-Agent Workflow Evaluation Multi-agent workflows involve multiple models (or agents) collaborating to achieve a shared goal. A restaurant recommendation system, for example, may feature one agent parsing user preferences, another querying a knowledge graph, and a third crafting human-like responses. Synthetic data can: Simulate complex user personas: From foodies to budget-conscious travelers, generating interactions that test the robustness of multi-agent collaboration. Recreate realistic workflows: Model intricate agent-to-agent interactions, complete with asynchronous communication and fallback mechanisms. Stress-test failure scenarios: Ensure agents recover gracefully from errors, misunderstandings, or timeouts. Multi-agent workflows often rely on hybrid architectures that combine SLMs, LLMs, domain-specific models, and fine-tuned systems to balance cost, latency, and accuracy. Synthetic data generated by a superior model can serve as a baseline for evaluating nuances like agent orchestration and error recovery. Azure AI Evaluator Simulator: A Game-Changer Azure AI's Evaluator Simulator Package offers a robust framework for generating synthetic interaction data tailored to your application needs. By simulating diverse user personas and scenarios, it provides: Realistic Simulations: Emulate a wide range of user behaviors, preferences, and intents, making it ideal for creating datasets for function-calling and multi-agent workflows. Customizability: Tailor simulations to reflect domain-specific nuances, ensuring data relevance. Efficiency: Automate data generation at scale, saving time and resources compared to manual annotation. How It Works The Azure AI Evaluation SDK’s Simulator class is designed to generate synthetic conversations and simulate task-based interactions. The module allows you to configure different personas—such as tech-savvy users, college grads, enterprise professionals, customers, supply chain managers, procurement manager, finance admin etc each interacting with your application in unique ways. You can also define the tasks that each of these users are trying to accomplish like shopping for a family event, manging inventory, preparing financial reports etc. Here’s how it operates: Model Configuration: Initialize the simulator with your model’s parameters (e.g., temperature, top_p, presence_penalty). Input Preparation: Provide input data (e.g., text blobs) for context, such as extracting text from a Wikipedia page. Prompt Optimization: Use the query_response_generating_prompty_override to customize how query-response pairs are generated. User Prompt Specification: Define user behavior using the user_simulating_prompty_override to align simulations with specific personas. Target Callback Specification: Implement a callback function that connects the simulator with your application. Simulation Execution: Run the simulator to generate synthetic conversations based on your configurations. By following these steps, developers can create robust test datasets, enabling thorough evaluation and fine-tuning of their AI applications. Example: Synthetic Data for an E-Commerce Assistant Bot Let’s walk through an example of generating synthetic data for an e-commerce assistant bot. This bot can perform tasks such as acting as a shopping assistant, managing inventory, and creating promo codes. Before we get started, make sure to install azure-ai-evaluation package to follow along Step 1: Define Functions and APIs Start by defining the core functions the bot can invoke, such as search_products, fetch_product_details, and add_to_cart. These functions simulate real-world operations. Please refer functions and function_list to access the complete list of functions and function definitions. Step 2: Configure the Simulator model_config = { "azure_endpoint": azure_endpoint, "azure_api_key": azure_api_key, "azure_deployment": azure_deployment, } from azure.ai.evaluation.simulator import Simulator simulator = Simulator(model_config=model_config) Next connect the simulator to the application. For this, establish the client and implement a callback function that invokes the application and facilitate interaction between the simulator and app from typing import List, Dict, Any, Optional from functions import * from function_list import function_list from openai import AzureOpenAI from azure.identity import DefaultAzureCredential, get_bearer_token_provider def call_to_ai_application(query: str) -> str: # logic to call your application # use a try except block to catch any errors system_message = "Assume the role of e-commerce assistant designed for multiple roles. You can help with creating promo codes, tracking their usage, checking stock levels, helping customers make shopping decisions and more. You have access to a bunch of tools that you can use to help you with your tasks. You can also ask the user for more information if needed." completion = client.chat.completions.create( model=azure_deployment, messages=[ {"role" : "system", "content" : system_message }, { "role": "user", "content": query, } ], max_tokens=800, temperature=0.1, top_p=0.2, frequency_penalty=0, presence_penalty=0, stop=None, stream=False, tools = function_list, tool_choice="auto" ) message = completion.choices[0].message # print("Message : ", message) # change this to return the response from your application return message async def callback( messages: List[Dict], stream: bool = False, session_state: Any = None, # noqa: ANN401 context: Optional[Dict[str, Any]] = None, ) -> dict: messages_list = messages["messages"] # get last message latest_message = messages_list[-1] query = latest_message["content"] context = None # call your endpoint or ai application here response = call_to_ai_application(query) # we are formatting the response to follow the openAI chat protocol format: if response.tool_calls: prev_messages = messages["messages"] func_call_messages = [] tool_calls = response.tool_calls ## Add the tool calls to the messages for tool_call in tool_calls: formatted_response = {"role" : "assistant", "function_call" : tool_call.function.to_dict()} func_call_messages.append(formatted_response) ## Execute the APIs and add the responses to the messages for tool_call in tool_calls: function_name = tool_call.function.name function_args = tool_call.function.arguments func = globals().get(function_name) if callable(func): result = json.dumps(func(**json.loads(function_args))) # formatted_response = {"content" : result, "role" : "tool", "name" : function_name} formatted_response = {"role" : "function", "content" : result, "name" : function_name} func_call_messages.append(formatted_response) else: print("Function {} not found".format(function_name)) # Second API call: Get the final response from the model final_response = client.chat.completions.create( model=azure_deployment, messages=prev_messages + func_call_messages, ) final_response = {"content" : final_response.choices[0].message.content, "role" : "assistant"} func_call_messages.append(final_response) # Stringify func_call messages to store in session state func_call_messages = create_content_from_func_calls(func_call_messages) func_call_messages = {"role" : "assistant", "content" : func_call_messages} messages["messages"].append(func_call_messages) # messages["messages"].append(final_response) return {"messages": messages["messages"], "stream": stream, "session_state": session_state} else: formatted_response = { "content": response.content, "role": "assistant", } messages["messages"].append(formatted_response) return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context} We have used two helper functions here : create_content_from_func_calls : It creates a string content from a list of function call dictionaries. This merges all the internal messages invoking function calls into a single string. This is needed as the simulator module ignores all internal context and only retains the latest response. split_content : Split a string content into a list of dictionaries based on specified separators. This is required for post-processing step to split the string comprising of function-call and function-response into separate messages each with its own role and content. Step 3: Define the Tasks Use the Azure AI Evaluation SDK to configure the simulator with user personas and tasks, such as: A marketing manager creating a promo code and tracking its usage. A customer making a purchase using the promo code. An inventory manager checking stock levels. Step 4: Customize user persona Internally, the SDK has a prompty file that defines how the LLM which simulates the user should behave. The SDK also offers an option for users to override the file, to support your own prompty files. Let’s override this file to build a user persona who engages in an interactive conversation with the bot and asks follow up questions while responding to bot’s response basis his persona and requirement system: You must behave as a user who wants accomplish this task: {{ task }} and you continue to interact with a system that responds to your queries. If there is a message in the conversation history from the assistant, make sure you read the content of the message and include it your first response. Your mood is {{ mood }} Make sure your conversation is engaging and interactive. Output must be in JSON format Here's a sample output: { "content": "Here is my follow-up question.", "role": "user" } Step 5 : Generate and Store Outputs: Run the simulator to generate synthetic data. You can specify the "num_conversation_turns" that defines the predetermined number of conversation turns to simulate. outputs = await simulator( target=callback, text="Assume the role of e-commerce assistant designed for multiple roles. You can help with creating promo codes, tracking their usage, checking stock levels, helping customers make shopping decisions and more. You have access to a bunch of tools that you can use to help you with your tasks. You can also ask the user for more information if needed.", num_queries=3, max_conversation_turns=5, tasks=tasks, user_simulator_prompty=user_override_prompty, user_simulator_prompty_kwargs=user_prompty_kwargs, ) Step 6 : Review and Save the Outputs Let's look at the output for one of the tasks We can see how the simulator engages in an interactive conversation with the application to accomplish the desired task and all the interaction between app and simulator is captured in the final output. Let's store the output in a file with open("output.json", "w") as f: json.dump(final_outputs, f) Conclusion Synthetic data transcends being a mere substitute for real-world data—it’s a strategic asset for fine-tuning and evaluating LLMs. By enabling precise control over data generation, synthetic datasets empower developers to simulate user behaviors, test edge cases, and optimize models for specific workflows. With tools like Azure AI’s Evaluator Simulator, generating this data has never been more accessible or impactful. Whether you’re building models for function-calling, orchestrating multi-agent systems, or tackling niche use cases, synthetic data ensures you’re equipped to deliver reliable, high-performing solutions—regardless of complexity. Start leveraging synthetic data today and unlock the full potential of your LLM projects! You can access the full code here References azureai-samples/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text at main · Azure-Samples/azureai-samples How to generate synthetic and simulated data for evaluation - Azure AI Foundry | Microsoft Learn Generate Synthetic QnAs from Real-world Data on Azure | Microsoft Community Hub How to use function calling with Azure OpenAI Service - Azure OpenAI Service | Microsoft Learn Fine-tuning function calls with Azure OpenAI Service - Azure AI services | Microsoft Learn538Views0likes0CommentsDiscover the Azure AI Training Profiler: Transforming Large-Scale AI Jobs
Meet the AI Training Profiler Large-scale AI training can be complicated, especially in distributed environments like healthcare, finance, and e-commerce, where the need for accuracy, speed, and massive data processing is crucial. Efficiently managing hardware resources, ensuring smooth parallelism, and minimizing bottlenecks are crucial for optimal performance. The AI Training Profiler powered by PyTorch Profiler inAzure Machine Learning is here to help! By giving you detailed visibility into hardware and software metrics, this tool helps you spot inefficiencies, make the best use of resources, and scale your training workflows like a pro. Why Choose the AI Training Profiler? Running large AI training jobs on distributed infrastructure is inherently complex, and inefficiencies can quickly escalate into increased costs and delays in deploying models. The AI Training Profiler addresses these issues by providing a comprehensive breakdown of compute resource usage throughout the training lifecycle. This enables users to fine-tune and streamline their AI workflows, yielding several key benefits: Improved Performance: Identify bottlenecks and inefficiencies, such as slow data loading or underutilized GPUs, to enhance training throughput. Reduced Costs: Detect idle or underused resources, thereby minimizing compute time and hardware expenses. Faster Debugging: Leverage real-time monitoring and intuitive visualizations to troubleshoot performance issues swiftly. Key Features of the AI Training Profiler GPU Core and Tensor Core Utilization The profiler meticulously tracks GPU kernel execution, reporting utilization metrics such as time spent on forward and backward passes, tensor core operations, and other computation-heavy tasks. This detailed breakdown enables users to pinpoint under-utilized resources and optimize kernel execution patterns. Memory Profiling Memory Allocation and Peak Usage: Monitors GPU memory usage throughout the training process, offering insights into underutilized or over-allocated memory. CUDA Memory Footprint: Visualizes memory consumption during forward/backward propagation and optimizer steps to identify bottlenecks or fragmentation. Page Fault and Out-of-Memory Events: Detects critical events that could slow training or cause job failures due to insufficient memory allocation. Kernel Execution Metrics Kernel Execution Time: Provides per-kernel timing, breaking down execution into compute-bound and memory-bound operations, allowing users to discern whether performance bottlenecks stem from inefficient kernel launches or memory access patterns. Instruction-level Performance: Measures IPC (Instructions Per Cycle) to understand kernel-level performance and identify inefficient operations. Distributed Training Communication Primitives: Captures inter-GPU and inter-node communication patterns, focusing on the performance of primitives like AllReduce, AllGather, and Broadcast in multi-GPU training. This helps users identify communication bottlenecks such as imbalanced data distribution or excessive communication overhead. Synchronization Events: Measures the time spent on synchronization barriers between GPUs, highlighting where parallel execution is slowed by synchronization. Getting Started with the Profiling Process Using the AI Training Profiler is a breeze! Activate it when you launch a job, either through the CLI or our platform’s user-friendly interface. Here are the three environment variables you need to set: Enable/Disable the Profiler: ENABLE_AZUREML_TRAINING_PROFILER: 'true' Configure Trace Capture Duration: AZUREML_PROFILER_RUN_DURATION_MILLISECOND: '50000' Delay the Start of Trace Capturing: AZUREML_PROFILER_WAIT_DURATION_SECOND: '1200' Once your training job is running, the profiler collects metrics and stores them centrally. After the run, this data is analyzed to give you visual insights into critical metrics like kernel execution times. Use Cases The AI Training Profiler is a game-changer for fine-tuning large language models and other extensive architectures. By ensuring efficient GPU utilization and minimizing distributed training costs, this tool helps organizations get the most out of their infrastructure, whether they're working on cutting-edge models or refining existing workflows. In conclusion, the AI Training Profiler is a must-have for teams running large-scale AI training jobs. It offers the visibility and control needed to optimize resource utilization, reduce costs, and accelerate time to results. Embrace the future of AI training optimization with the AI Training Profiler and unlock the full potential of your AI endeavors. How to Get Started? The feature is available as a preview, you can just set up the environment variables and start using the profiler! Stay tuned for future repository with many samples that you can use as well!587Views2likes0CommentsSupercharge Your Deep Learning Workflows with NVIDIA Nsight Systems
Machine learning (ML), deep learning (DL), and AI workloads are becoming more complex, necessitating efficient resource utilization and time management. NVIDIA Nsight Systems is a powerful performance analysis tool designed to optimize these workloads by providing insights into application behavior on GPUs and CPUs. This blog post discusses the importance of optimizing ML/DL workloads to improve training times and productivity and provides an overview of NVIDIA Nsight Systems, including its key UI components. The post offers a practical example of optimizing a deep learning model using the FashionMNIST dataset. Initial profiling with Nsight Systems reveals bottlenecks in data handling rather than GPU computation. By increasing batch size, parallelizing data loading, and reducing fork operations, the training time is significantly reduced from 28 seconds to just 2.2 seconds. Further optimizations include enabling Automatic Mixed Precision (AMP) and using DistributedDataParallel for multi-GPU setups. By leveraging these optimizations and profiling tools like NVIDIA Nsight Systems, ML/DL workloads can achieve substantial performance improvements, leading to faster training times and more efficient use of resources.222Views1like0CommentsUnlocking the Power of Large-Scale Training in AI
Why Large-Scale Training? So, why are we so obsessed with large-scale AI models anyway? Well, larger models have more parameters—think of these as tiny levers and switches that adjust to learn from data. The more parameters, the more complex tasks a model can handle. In the world of natural language processing (NLP), for instance, GPT-3 boasts 175 billion parameters, making it capable of understanding nuanced language and generating impressive responses. These larger models don’t just stop at text. They’re pushing boundaries in healthcare, finance, and beyond, handling things like medical image analysis, fraud detection, and even predicting patient outcomes. But here is the catch: as these models increase in parameters, so does the need for immense computational power. Training a model as big as GPT-3 on a single machine? That’s a non-starter—it would take forever. And that’s where distributed training comes in. The Perks (and Pitfalls) of Large-Scale Training Building large AI models unlocks incredible possibilities, but it’s not all sunshine and rainbows. Here’s a peek into the main challenges that come with training these behemoths: Memory Limitations Picture this: you have a huge model with billions of parameters, but each GPU has limited memory. Trying to squeeze the whole model into a single GPU? Forget it. It’s like trying to stuff an elephant into a suitcase. Computation Bottlenecks Even if you could load the model, running it would take weeks—maybe even months. With every training step, the compute requirements grow, and training on a single machine becomes both a time and cost nightmare. Data Synchronization & Management Now imagine you’ve got multiple GPUs or nodes working together. That sounds good in theory, but all these devices need to stay in sync. Model parameters and gradients (fancy math terms for “how the model learns”) need to be shared constantly across all GPUs. If not managed carefully, this can slow training down to a crawl. These challenges make it clear why simply “scaling up” on one machine isn’t enough. We need something better—and that’s where distributed training steps in. Distributed Training: The Secret Sauce for Large AI Models Distributed training is like assembling an elite team of GPUs and servers to tackle different parts of the problem simultaneously. This process breaks up the heavy lifting, spreading the workload across multiple machines to make things run faster and more efficiently. Why Go Distributed? Faster Training Times By splitting up the work, distributed training slashes training time. A job that might have taken weeks on one machine can often be completed in days—or even hours—by spreading it across multiple devices. Big Data? No Problem Distributed training is also a lifesaver when dealing with massive datasets. It can process these large datasets in parallel, helping the model learn faster by exposing it to more data in less time. Imagine trying to watch a series by watching one episode on your laptop, another on your phone, and another on your tablet—all at once. That’s the efficiency we’re talking about here. Scalability Need more power? Distributed training allows you to scale up with additional GPUs or nodes. Think of it as being able to add more horsepower to your AI engine anytime you need it. For a deeper dive into distributed training principles, check out this guide on distributed training with Azure. The Different Flavors of Distributed Training Distributed training isn’t one-size-fits-all. It comes in several “flavors,” each suited to different needs: Data Parallelism: Here, we split the dataset across multiple GPUs, each GPU trains on its chunk of the data, and then they synchronize to keep the model consistent. It’s great when the model can fit on a single GPU, but the dataset is too large. Model Parallelism: For models that are just too huge to fit on one GPU, model parallelism divides the model itself across GPUs. Each part of the model is trained on a different GPU, which is ideal for extremely large architectures like some NLP and vision models. Hybrid Approaches: The best of both worlds! By combining data and model parallelism, we can train large datasets on large models efficiently. Techniques like Microsoft’s ZeRO Redundancy Optimizer (ZeRO) take this a step further by distributing the memory load, making it possible to train super-large models even on limited hardware. Azure AI: A Distributed Training Powerhouse So, how does Azure AI fit into all this? Azure is like the ultimate toolkit for distributed training. It offers powerful infrastructure that not only handles the scale of large AI models but also makes the whole process a lot easier. What Makes Azure Stand Out? Optimized Infrastructure Azure’s infrastructure is built for high-performance computing (HPC). With ultra-fast InfiniBand networking, Azure’s VMs (Virtual Machines) allow for seamless data transfer between GPUs and nodes. This is critical when training large models that require low-latency communication between devices. Top-Notch GPU Offerings Azure provides access to some of the latest and greatest GPUs, like NVIDIA’s A100 and H100 models. These GPUs are purpose-built for deep learning, featuring tensor cores that accelerate matrix computations—the backbone of deep learning. And they’re interconnected with NVLink and NVSwitch technology, which significantly reduces data transfer delays. This makes Azure the perfect playground for massive model training. Scalable Architecture Azure Machine Learning provides a versatile range of compute options that adapt to the demands of large-scale model training, from experimentation to full-scale distributed training. At the core are compute clusters, which allow you to set up managed clusters of virtual machines that can automatically scale up or down based on workload needs. These clusters support various VM types, including GPU-optimized options like the ND A100 v4 series, powered by NVIDIA A100 GPUs, ideal for high-performance distributed training. For smaller-scale development, Compute Instances offer on-demand, single-node machines for interactive sessions, making them perfect for prototyping and debugging. For budget-conscious projects, Azure Machine Learning also supports spot VMs in compute clusters, which utilize unused Azure capacity at a lower cost. This option is ideal for non-critical jobs like hyperparameter tuning, where interruptions are manageable. Together, these compute offerings ensure you can scale flexibly and efficiently, using the right resources for each stage of model development. Explore more about Azure Machine Learning compute options, GPU-optimized virtual machines, and how to leverage spot VMs for cost savings on the Azure platform. Curious to see what distributed training looks like in practice? Here’s a tutorial that walks you through setting up distributed training on Azure. How Azure Enables Distributed Learning Azure AI doesn’t just provide raw power; it gives you the tools to manage, optimize, and streamline the distributed training process. Azure offers a suite of tools and frameworks specifically designed to make distributed training accessible, flexible, and efficient. Azure Machine Learning SDK and CLI Azure’s Machine Learning SDK and CLI make it simple to set up, run, and manage distributed training jobs. With the SDK, you can define custom environments, set up compute clusters, and even submit jobs with YAML configurations, making it easy to replicate setups and automate workflows. Support for Popular Frameworks Azure ML is compatible with popular machine learning frameworks like PyTorch and TensorFlow, so you don’t have to worry about changing your entire workflow. Azure ML has built-in support for distributed training within these frameworks, using strategies like Distributed Data Parallel (DDP) and Horovod, a framework designed for distributed deep learning. Advanced Optimization with DeepSpeed Microsoft’s DeepSpeed library is integrated with Azure, providing state-of-the-art optimizations for large model training. DeepSpeed’s memory and computation optimizations, like the ZeRO Optimizer, allow you to train larger models more efficiently, reducing memory requirements and improving training speed. Hyperparameter Tuning with HyperDrive Azure ML’s HyperDrive tool makes hyperparameter tuning straightforward. Define search spaces and optimization strategies, and HyperDrive will run parallel trials to find the best configurations, even stopping underperforming trials early to save resources. It’s hyperparameter tuning on autopilot! Monitoring and Diagnostics Azure provides real-time monitoring with Azure ML Studio dashboards, showing metrics like GPU utilization, loss curves, and throughput. For deeper insights, tools like Azure Monitor and NVIDIA NSight Systems provide detailed diagnostics, helping you identify bottlenecks and optimize your training jobs. This robust toolkit ensures that Azure can handle not only the scale but also the complexity of distributed training, providing the infrastructure and tools you need to train the most advanced AI models efficiently. Real-World Success: What Makes Azure Stand Out for Distributed Learning and AI Azure AI Foundry is more than just a platform—it’s a powerhouse for enabling organizations to achieve groundbreaking results in AI. What makes Azure stand out in distributed learning is its unique combination of high-performance infrastructure, scalability, and a suite of tools designed to make distributed training as efficient and accessible as possible. Here are a few key reasons why Azure is the go-to choice for distributed AI training: High-Performance Infrastructure Azure offers high-performance computing (HPC) resources that are essential for large-scale training. Features like InfiniBand networking provide ultra-low latency and high throughput, making it ideal for workloads that require constant communication across GPUs and nodes. This enables faster synchronization and helps avoid bottlenecks in distributed setups. Advanced GPU Options With NVIDIA’s latest GPUs, such as the A100 and H100, Azure delivers the computational muscle required for deep learning tasks. These GPUs, designed with AI in mind, feature tensor cores that accelerate complex calculations, making them perfect for training large models. Azure’s NVLink and NVSwitch technology connect these GPUs for fast data transfer, further boosting performance. Scalability with VM Scale Sets One of Azure’s key differentiators is its VM Scale Sets, which allow for elastic scaling based on workload demands. This means that you can start small and scale up as your models and datasets grow. Azure’s auto-scaling capabilities ensure that resources are used efficiently, lowering costs while meeting the needs of even the largest models. All-in-One Machine Learning Platform With Azure Machine Learning (Azure ML), you get an end-to-end platform that handles everything from compute cluster management to environment setup and job orchestration. Azure ML takes care of the heavy lifting, enabling you to focus on developing and optimizing your models. Integration with Open-Source and Proprietary Tools Azure supports all major machine learning frameworks and has its own optimization tools like DeepSpeed and HyperDrive. This flexibility lets you pick the best tools for your specific needs, while benefiting from Azure’s optimized infrastructure. Azure’s distributed training capabilities make it possible for organizations to push the boundaries of what’s possible with AI. From improving training speed to enabling real-time insights, Azure is setting the standard for large-scale AI success. Wrapping Up: The Future of Large-Scale AI Training As AI models grow in complexity and capability, the need for efficient, large-scale training will only become more pressing. Distributed training, powered by platforms like Azure AI, is paving the way for the next generation of AI. It offers a robust solution to the limitations of single-device training, enabling faster development, greater scalability, and better performance. Whether you’re working in NLP, computer vision, healthcare, or finance, the ability to train large models efficiently is a game-changer. Ready to scale up your AI? Explore distributed training best practices and discover the power of large-scale AI development.418Views0likes0CommentsFine-tune FLUX.1 LORA with your own images and Deploy using Azure Machine Learning
The landscape of artificial intelligence and machine learning continues to evolve rapidly, with significant advancements in generative AI models. One such notable development comes from Black Forest Labs with their FLUX.1 suite of models. These models push the boundaries of text-to-image synthesis, offering unparalleled image detail, prompt adherence, and style diversity. In this blog, we will delve into the process of fine-tuning the FLUX model using Dreambooth, a method that has gained traction for its effectiveness in producing high-quality, customized AI-generated content.13KViews2likes4Comments