AI - Machine Learning Blog

14 MIN READ

Fine-Tuning DeepSeek-R1-Distill-Llama-8B with PyTorch FSDP, QLoRA on Azure Machine Learning

Microsoft

Feb 13, 2025

Large Language Models (LLMs) have demonstrated remarkable capabilities across various industries, revolutionizing how we approach tasks like legal document summarization, creative content generation, and customer sentiment analysis. However, adapting these general-purpose models to excel in specific domains often requires fine-tuning. This is where fine-tuning comes in, allowing us to tailor LLMs to meet unique requirements and improve their performance on targeted tasks.

In this blog post, we'll explore the process of fine-tuning the DeepSeek-R1-Distill-Llama-8B model, highlighting the advantages of using PyTorch Fully Sharded Data Parallel (FSDP) and Quantization-Aware Low-Rank Adaptation (QLoRA) techniques in conjunction with the Azure Machine Learning platform.

Why Fine-Tuning Matters

In some cases, LLMs may not perform well on specific domains, tasks, or datasets, or may produce inaccurate or misleading outputs. In such cases, fine-tuning the model can be a useful technique to adapt it to the desired goal and improve its quality and reliability.

Hallucinations: Hallucinations are untrue statements output by the model. They can harm the credibility and trustworthiness of your application. One possible mitigation is fine-tuning the model with data that contains accurate and consistent information.
Accuracy and quality problems: Pre-trained models may not achieve the desired level of accuracy or quality for a specific task or domain. This shortfall can be due a mismatch between the pre-training data and the target data, the diversity and complexity of the target data, and/or incorrect evaluation metrics and criteria.

DeepSeek-R1 is an open-source language model excelling in text-based tasks, including creative writing, question answering, editing, and summarization. It's particularly strong in reasoning-intensive tasks like coding, math, and explaining scientific concepts. DeepSeek-R1 stands out due to its mixture of experts (MoE) architecture and use of reinforcement learning, achieving high performance with greater efficiency and lower costs compared to other models. It has 671 billion parameters across multiple expert networks, but only 37 billion are required for a single forward pass. DeepSeek-R1 uses reinforcement learning (RL) to generate a chain-of-thought (CoT) before delivering its final answer. To make these capabilities more accessible, DeepSeek has distilled its R1 outputs into several smaller models. DeepSeek has also created smaller, distilled versions based on Qwen and Llama architectures.

Qwen-based distilled models: 1.5B, 7B, 14B and 32B
Llama-based distilled models: 8B and 70B

DeepSeek-R1-Distill-Llama-8B is a distilled large language model (LLM) based on the Llama architecture, created using outputs from the larger DeepSeek-R1 model. Through knowledge distillation, the reasoning patterns of the larger 671 billion parameter DeepSeek-R1 model are transferred into a smaller, more efficient model. The DeepSeek-R1-Distill-Llama-8B has only 8 billion parameters, making it computationally efficient while retaining a significant portion of the original model's performance. It is fine-tuned from models like Llama-3.1-8B-Instruct, achieving high performance across multiple benchmarks. This distilled model offers a balance of performance and resource requirements, improving inference speed and reducing computational costs, making it cost-effective for production deployments.

PyTorch FSDP: Scaling Fine-Tuning with Data Parallelism

PyTorch Fully Sharded Data Parallel (FSDP) is a distributed training framework that addresses the challenges of fine-tuning large models by sharding model parameters, optimizer states, and gradients across multiple GPUs. This technique enables you to train models with billions of parameters on systems with limited GPU memory.

QLoRA: Efficient Fine-Tuning with Quantization and Low-Rank Adaptation

Quantization-Aware Low-Rank Adaptation (QLoRA) is a parameter-efficient fine-tuning technique that reduces memory usage and accelerates training by quantizing the model weights and fine-tuning only a small subset of parameters. QLoRA leverages Low-Rank Adaptation (LoRA) to fine-tune only a small subset of the model’s parameters, making training faster and memory efficient.

Azure Machine Learning: Your Platform for Scalable Fine-Tuning

Azure Machine Learning provides a robust platform for fine-tuning LLMs, offering a comprehensive suite of tools and services to streamline the process.

Scalable Compute: Azure Machine Learning Compute provides virtual machines (VMs) that run parts of the distributed deep learning job, auto-scaling as necessary. Azure Machine Learning compute clusters can schedule tasks, collect results, adjust resources to actual loads, and manage errors[5]. VMs that participate in the cluster can be GPU-enabled to accelerate deep learning calculations.
Data Storage: Azure offers standard and premium blob storage options for storing training data and execution logs. Premium blob storage is used to store training data and enable high-performance access during model training, which is needed for distributed training.
Experiment Tracking: Azure Machine Learning provides tools for tracking and managing your fine-tuning experiments, allowing you to monitor performance metrics and reproduce your results.

Hands-on lab

Now let’s start finetune and deploy the same on AML.

Lets sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. It imports necessary libraries and handles exceptions during the ML client initialization.

# import required libraries
"""
This script sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. 
It imports necessary libraries and handles exceptions during the ML client initialization.

Modules imported:
- time: Provides various time-related functions.
- azure.identity: Provides authentication capabilities with DefaultAzureCredential and InteractiveBrowserCredential.
- azure.ai.ml: Contains classes and functions for interacting with Azure ML services, including MLClient, Input, pipeline, load_component, command, Data, Environment, BuildContext, Model, Input, Output, and AssetTypes.
- azure.core.exceptions: Contains exceptions for handling resource-related errors.
- os: Provides a way to interact with the operating system.

Variables:
- credential: An instance of DefaultAzureCredential used for authenticating with Azure services.
- ml_client: An instance of MLClient initialized using the provided credentials. If the initialization fails, an exception is caught and printed.
"""
import time
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data, Environment, BuildContext
from azure.ai.ml.entities import Model
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError
import os 

credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)

Now lets install some libraries required to download the dataset and run the openai client.

%conda run -n azureml_py310_sdkv2 pip install datasets==3.2.0 openai

Lets create our training environment.

os.makedirs("environment_train", exist_ok=True)

Lets build our docker environment.

%%writefile environment_train/Dockerfile
FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch22x:biweekly.202501.3

USER root

# support Deepspeed launcher requirement of passwordless ssh login
RUN apt-get update && apt-get -y upgrade
RUN pip install --upgrade pip
RUN apt-get install -y openssh-server openssh-client

# Install pip dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir

RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation

Let’s also specify our requirements.txt

%%writefile environment_train/requirements.txt
transformers==4.48.2
peft==0.14.0
accelerate==1.3.0
bitsandbytes==0.45.1
datasets==3.2.0
evaluate==0.4.3
huggingface_hub[hf_transfer]
safetensors>=0.5.2
sentencepiece==0.2.0
scikit-learn==1.6.1
tokenizers>=0.21.0
py7zr

Once we specify both lets create the AML custom training environment.

env_name = "deepseek-training"
env_docker_image = Environment(
                build=BuildContext(path = "environment_train", dockerfile_path="Dockerfile"),
                name=env_name,
                description="Environment created for llm fine-tuning.",
                version="1"
            )
env_asset_train = ml_client.environments.create_or_update(env_docker_image)

While the training environment is ready let’s start with the dataset preparation.

from datasets import load_dataset
import pandas as pd

dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")

df = pd.DataFrame(dataset['train'])
df = df.iloc[0:2000]

df.head()

Here is quick snapshot of what the dataset looks like

Noe lets split the dataset into train and test for validation.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.1, random_state=42)

print("Number of train elements: ", len(train))
print("Number of test elements: ", len(test))

Let’s create the prompt template to run the finetuning process. In this case we have used COT prompt template.

# custom instruct prompt start
prompt_template = f"""
<｜begin▁of▁sentence｜>
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
<｜User｜>
{{question}}
<｜Assistant｜>
<think>
{{complex_cot}}
</think>

{{answer}}
<｜end▁of▁sentence｜>
"""

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(question=sample["Question"],
                                            complex_cot=sample["Complex_CoT"],
                                            answer=sample["Response"])
    return sample

Let’s run the mapping of this prompt through the whole dataset and create train and test jsonl files..

from datasets import Dataset, DatasetDict
from random import randint

train_dataset = Dataset.from_pandas(train)
test_dataset = Dataset.from_pandas(test)

dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features))

print(train_dataset[randint(0, len(dataset))]["text"])

test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features))

train_dataset.to_json(f"data/train.jsonl")
test_dataset.to_json(f"data/eval.jsonl")

Now let’s start creating our training script.

os.makedirs("src_train", exist_ok=True)

write the train.py which uses both Qlora and PyTorch FSDP.

%%writefile src_train/train.py

import os
import argparse
import sys
import logging
from accelerate import Accelerator
import datetime
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed
import transformers
import traceback
from huggingface_hub import snapshot_download
from datasets import load_dataset

def download_model(model_name):
    print("Downloading model ", model_name)

    os.makedirs("/tmp/tmp_folder", exist_ok=True)

    snapshot_download(repo_id=model_name, local_dir="/tmp/tmp_folder")

    print(f"Model {model_name} downloaded under /tmp/tmp_folder")

def init_distributed():
    # Initialize the process group
    torch.distributed.init_process_group(
        backend="nccl", # Use "gloo" backend for CPU
        timeout=datetime.timedelta(seconds=5400)
    )
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    return local_rank

def main(args):
    model_name = args.model_name_or_path
    train_ds = load_dataset('json', data_files=args.train_file, split='train')
    test_ds = load_dataset('json', data_files=args.eval_file, split='train')
    per_device_train_batch_size=args.train_batch_size
    per_device_eval_batch_size=args.eval_batch_size
    gradient_accumulation_steps=args.grad_accum_steps
    learning_rate=args.learning_rate
    num_train_epochs=args.epochs
    lora_r=8
    lora_alpha=16
    lora_dropout=0.1
    fsdp="full_shard auto_wrap offload"
    fsdp_config={
            'backward_prefetch': 'backward_pre',
            'cpu_ram_efficient_loading': True,
            'offload_params': True,
            'forward_prefetch': False,
            'use_orig_params': False
        }
    gradient_checkpointing=False
    merge_weights=True
    seed=42
    token=None
    model_dir = args.model_dir

    if torch.cuda.is_available() and (torch.cuda.device_count() > 1 or int(os.environ.get("SM_HOST_COUNT", 1)) > 1):
        # Call this function at the beginning of your script
        local_rank = init_distributed()

        # Now you can use distributed functionalities
        torch.distributed.barrier(device_ids=[local_rank])

    os.environ.update({"HF_HUB_ENABLE_HF_TRANSFER": "1"})

    set_seed(seed)

    accelerator = Accelerator()

    if token is not None:
        os.environ.update({"HF_TOKEN": token})
        accelerator.wait_for_everyone()

    if int(os.environ.get("SM_HOST_COUNT", 1)) == 1:
        if accelerator.is_main_process:
            download_model(model_name)
    else:
        download_model(model_name)

    accelerator.wait_for_everyone()

    model_name = "/tmp/tmp_folder"

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Set Tokenizer pad Token
    tokenizer.pad_token = tokenizer.eos_token

    with accelerator.main_process_first():
        # tokenize and chunk dataset
        lm_train_dataset = train_ds.map(
            lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features)
        )

        print(f"Total number of train samples: {len(lm_train_dataset)}")

        if test_ds is not None:
            lm_test_dataset = test_ds.map(
                lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features)
            )

            print(f"Total number of test samples: {len(lm_test_dataset)}")
        else:
            lm_test_dataset = None

    torch_dtype = torch.bfloat16

    # Defining additional configs for FSDP
    if fsdp != "" and fsdp_config is not None:
        bnb_config_params = {
            "bnb_4bit_quant_storage": torch_dtype
        }

        model_configs = {
            "torch_dtype": torch_dtype
        }

        fsdp_configurations = {
            "fsdp": fsdp,
            "fsdp_config": fsdp_config,
            "gradient_checkpointing_kwargs": {
                "use_reentrant": False
            },
            "tf32": True
        }
    else:
        bnb_config_params = dict()
        model_configs = dict()
        fsdp_configurations = dict()

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch_dtype,
        **bnb_config_params
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        quantization_config=bnb_config,
        attn_implementation="flash_attention_2",
        use_cache=not gradient_checkpointing,
        cache_dir="/tmp/.cache",
        **model_configs
    )

    if fsdp == "" and fsdp_config is None:
        model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)

    if gradient_checkpointing:
        model.gradient_checkpointing_enable()

    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules="all-linear",
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)

    trainer = transformers.Trainer(
        model=model,
        train_dataset=lm_train_dataset,
        eval_dataset=lm_test_dataset if lm_test_dataset is not None else None,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            gradient_checkpointing=gradient_checkpointing,
            logging_strategy="steps",
            logging_steps=1,
            log_on_each_node=False,
            num_train_epochs=num_train_epochs,
            learning_rate=learning_rate,
            bf16=True,
            ddp_find_unused_parameters=False,
            save_strategy="no",
            output_dir="outputs",
            **fsdp_configurations
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    if trainer.accelerator.is_main_process:
        trainer.model.print_trainable_parameters()

    trainer.train()

    if trainer.is_fsdp_enabled:
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

    if merge_weights:
        output_dir = "/tmp/model"

        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)

        if accelerator.is_main_process:
            # clear memory
            del model
            del trainer

            torch.cuda.empty_cache()

            # load PEFT model
            model = AutoPeftModelForCausalLM.from_pretrained(
                output_dir,
                torch_dtype=torch.float16,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
            )

            # Merge LoRA and base model and save
            model = model.merge_and_unload()
            model.save_pretrained(
                model_dir,
                safe_serialization=True,
                max_shard_size="2GB"
            )
    else:
        trainer.model.save_pretrained(
            model_dir,
            safe_serialization=True
        )

    if accelerator.is_main_process:
        tokenizer.save_pretrained(model_dir)

    accelerator.wait_for_everyone()


def parse_args():
    # setup argparse
    parser = argparse.ArgumentParser()
    # curr_time = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")

    # hyperparameters
    parser.add_argument("--model_name_or_path", default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", type=str, help="Input directory for training")    
    parser.add_argument("--train_file", type=str, help="Input data for training")
    parser.add_argument("--eval_file", type=str, help="Input data for eval")
    parser.add_argument("--epochs", default=1, type=int, help="number of epochs")
    parser.add_argument("--train_batch_size", default=2, type=int, help="training - mini batch size for each gpu/process")
    parser.add_argument("--eval_batch_size", default=4, type=int, help="evaluation - mini batch size for each gpu/process")
    parser.add_argument("--grad_accum_steps", default=4, type=int, help="gradient accumulation steps")
    parser.add_argument("--learning_rate", default=2e-4, type=float, help="learning rate")
    parser.add_argument("--save_merged_model", type=bool, default=False)
    parser.add_argument("--model_dir", type=str, default="./", help="output directory for model")
    # parse args
    args = parser.parse_args()

    # return args
    return args



if __name__ == "__main__":
    #sys.argv = ['']
    args = parse_args()
    main(args)

Next step is to create a compute cluster on which the training will run.

azure_compute_cluster_name = "a100-compute"
azure_compute_cluster_size = "Standard_NC24ads_A100_v4"
USE_LOWPRIORITY_VM = True

from azure.ai.ml.entities import AmlCompute

### Create the compute cluster
try:
    compute = ml_client.compute.get(azure_compute_cluster_name)
except Exception as ex:
    try:
        tier = "LowPriority" if USE_LOWPRIORITY_VM else "Dedicated"
        compute = AmlCompute(
            name=azure_compute_cluster_name,
            size=azure_compute_cluster_size,
            tier=tier,
            max_instances=1,  # For multi node training set this to an integer value more than 1
        )
        ml_client.compute.begin_create_or_update(compute).wait()
    except Exception as e:
        print(e)

Once the compute is ready, lets run the training job.

from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration

str_command = ""
str_command += "python train.py --train_file ${{inputs.train_file}} --eval_file ${{inputs.eval_file}} \
            --epochs ${{inputs.epoch}} --train_batch_size ${{inputs.train_batch_size}} \
            --eval_batch_size ${{inputs.eval_batch_size}} --model_name_or_path ${{inputs.model_name_or_path}} \
            --model_dir ${{inputs.model_dir}} --save_merged_model ${{inputs.save_merged_model}}"



job = command(
    inputs=dict(
        train_file=Input(
            type="uri_file",
            path="data/train.jsonl",
        ),
        eval_file=Input(
            type="uri_file",
            path="data/eval.jsonl",
        ),
        epoch=1,
        train_batch_size=2,
        eval_batch_size=1,
        model_name_or_path="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        model_dir="./outputs",
        save_merged_model = True
    ),
    code="./src_train",  # local path where the code is stored
    compute=azure_compute_cluster_name,
    command=str_command,
    environment=env_asset_train,
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1,  # For multi-gpu training set this to an integer value more than 1
    },
)

returned_job = ml_client.jobs.create_or_update(job)
ml_client.jobs.stream(returned_job.name)

Once the training is completed, lets register the model as a custom model type.

from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

run_model = Model(
    path=f"azureml://jobs/{returned_job.name}/outputs/artifacts/paths/outputs/",
    name="deepseekr1-dist-llama8bft",
    description="Model created from run.",
    type=AssetTypes.CUSTOM_MODEL,
)

model = ml_client.models.create_or_update(run_model)

Once the model is registered the next step is to deploy the same as Online Managed Endpoint.

from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    IdentityConfiguration,
    ManagedIdentityConfiguration,
)

endpoint_name = "deepseekr1-dist-llama8bft-ep"

# Check if the endpoint already exists in the workspace
try:
    endpoint = ml_client.online_endpoints.get(endpoint_name)
    print("---Endpoint already exists---")
except:
    # Create an online endpoint if it doesn't exist

    # Define the endpoint
    endpoint = ManagedOnlineEndpoint(
        name=endpoint_name,
        description=f"Test endpoint for {model.name}"
    )

# Trigger the endpoint creation
try:
    ml_client.begin_create_or_update(endpoint).wait()
    print("\n---Endpoint created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Endpoint creation failed. Detailed Response:\n{err}"
    ) from err

Let’s define the deployment name , SKU type of the VM and Request timeout parameter.

# Initialize deployment parameters

deployment_name = "deepseekr1-dist-llama8bftd-eploy"
sku_name = "Standard_NC24ads_A100_v4"
REQUEST_TIMEOUT_MS = 90000
os.makedirs("environment_inf", exist_ok=True)

Lets create the environment for our inference .

%%writefile environment_inf/Dockerfile
FROM vllm/vllm-openai:latest

ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_NAME $VLLM_ARGS

Let’s build the environment with the docker file created above.

from azure.ai.ml.entities import Environment, BuildContext
env_docker_image = Environment(
            build=BuildContext(path="environment_inf", dockerfile_path= "Dockerfile"),
            name="vllm-custom",
            description="Environment created from a Docker context.",
            inference_config={
                "liveness_route": {
                    "port": 8000,
                    "path": "/health",
                },
                "readiness_route": {
                    "port": 8000,
                    "path": "/health",
                },
                "scoring_route": {
                    "port": 8000,
                    "path": "/",
                },
            },
        )
env_asset_inf = ml_client.environments.create_or_update(env_docker_image)

Once our environment for inference server is ready let’s do the deployment. Lets define some environment variables

model_path = f"/var/azureml-app/azureml-models/{model.name}/{model.version}/outputs"
env_vars = {
    "MODEL_NAME": model_path,
    "VLLM_ARGS": "--max-model-len 16000 --enforce-eager",
}
deployment_env_vars = {**env_vars}

Lets do the deployment now.

import time
from azure.ai.ml.entities import (    
    OnlineRequestSettings,
    CodeConfiguration,
    ManagedOnlineDeployment,
    ProbeSettings,
    Environment
)

t0 = time.time()
deployment = ManagedOnlineDeployment(
    name=  deployment_name,
    endpoint_name=endpoint_name,
    model=model,
    instance_type=sku_name,
    instance_count=1,
    environment_variables=deployment_env_vars,    
    environment=env_asset_inf,
    request_settings=OnlineRequestSettings(
        max_concurrent_requests_per_instance=2,
        request_timeout_ms=50000, 
        max_queue_wait_ms=60000
    ),
    liveness_probe=ProbeSettings(
        failure_threshold=5,
        success_threshold=1,
        timeout=10,
        period=30,
        initial_delay=120
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        timeout=2,
        period=10,
        initial_delay=120,
    ),
)

# Trigger the deployment creation
try:
    ml_client.begin_create_or_update(deployment).wait()
except Exception as err:
    raise RuntimeError(
        f"Deployment creation failed. Detailed Response:\n{err}"
    ) from err
    
endpoint.traffic = {deployment_name: 100}
endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint)

Wow!! Our endpoint is now deployed. Let’s start testing the same.

endpoint_results = endpoint_poller.result()
endpoint_name = endpoint_results.name
keys = ml_client.online_endpoints.get_keys(name=endpoint_name)
primary_key = keys.primary_key
url = os.path.join(endpoint_results.scoring_uri, "v1")
endpoint_name = (
    endpoint_results.name if endpoint_name is None else endpoint_name
)
keys = ml_client.online_endpoints.get_keys(name=endpoint_name)

Once we get the API keys we can use openai client to stream the tokens.

from openai import OpenAI
vllm_client = OpenAI(base_url=url, api_key=primary_key)

# Create your prompt
system_message = """You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response."""

user_message = f"""A 3-week-old child has been diagnosed with late onset perinatal meningitis, and the CSF culture shows gram-positive bacilli. What characteristic of this bacterium can specifically differentiate it from other bacterial agents?"""

response = vllm_client .chat.completions.create(
    model=model_path,
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
    ],
    temperature=0.7,
    max_tokens=4000,
    stream=True,  # Stream the response
)
 
print("Streaming response:")
for chunk in response:
    delta = chunk.choices[0].delta
    if hasattr(delta, "content"):
        print(delta.content, end="", flush=True)

Conclusion

Fine-tuning the DeepSeek-R1-Distill-Llama-8B model with PyTorch FSDP and QLoRA on Azure Machine Learning offers a powerful approach to customising LLMs for specific tasks. By leveraging the scalability and efficiency of these techniques, you can unlock the full potential of LLMs and drive innovation in your respective domain. Hope you liked the blog. Do like the blog and follow me for more such content.

Thanks

Manoranjan Rajguru

AI Global Black Belt

Published Feb 13, 2025

Version 1.0

azure machine learning

machine learning

mrajguru

Microsoft

Joined October 13, 2023

View Profile

AI - Machine Learning Blog