Introduction
This guide provides step-by-step instructions on how to run DeepSeek on Azure Kubernetes Service (AKS). The setup utilizes an ND-H100-v5 VM to accommodate the 4-bit quantized 671-billion parameter model on a single node.
Prerequisites
Before proceeding, ensure you have an AKS cluster with a node pool containing at least one ND-H100-v5 instance. Additionally, make sure the NVIDIA GPU drivers are properly set up.
Important: Set the OS disk size to 1024GB to accommodate the model.
For detailed instructions on creating an AKS cluster, refer to this guide.
Deploying Ollama with DeepSeek on AKS
We will use the PyTorch container from NVIDIA to deploy Ollama. First, configure authentication on AKS:
- Create an NGC API Key
- Visit NGC API Key Setup.
- Generate an API key.
- Create a Secret in the AKS Cluster
Run the following command to create a Kubernetes secret for authentication:kubectl create secret docker-registry nvcr-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<YOUR_NGC_API_KEY>
Deploy the Pod
Use the following Kubernetes manifest to deploy Ollama, download the model, and start serving it:
apiVersion: v1
kind: Pod
metadata:
name: ollama
spec:
containers:
- name: ollama
image: nvcr.io/nvidia/pytorch:24.09-py3
imagePullSecrets:
- name: nvcr-secret
securityContext:
capabilities:
add: ["IPC_LOCK"]
volumeMounts:
- mountPath: /dev/shm
name: shmem
resources:
requests:
nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
limits:
nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
command:
- bash
- -c
- |
export DEBIAN_FRONTEND=noninteractive
export OLLAMA_MODELS=/mnt/data/models
mkdir -p $OLLAMA_MODELS
apt update
apt install -y curl pciutils net-tools
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama server in the foreground
ollama serve 2>&1 | tee /mnt/data/ollama.log &
# Wait for Ollama to fully start before pulling the model
sleep 5
# Fix parameter issue (https://github.com/ollama/ollama/issues/8599)
cat >Modelfile <<EOF
FROM deepseek-r1:671b
PARAMETER num_ctx 24576
PARAMETER num_predict 8192
EOF
ollama create deepseek-r1:671b-fixed -f Modelfile
# Pull model
ollama pull deepseek-r1:671b-fixed
# Keep the container running
wait
volumes:
- emptyDir:
medium: Memory
sizeLimit: 128Gi
name: shmem
Connecting to Ollama
By default, this setup does not create a service. You can either define one or use port forwarding to connect Ollama to your local machine. To forward the port, run:
kubectl port-forward pod/ollama 11434:11434
Now, you can interact with the DeepSeek reasoning model using your favorite chat client. For example, in Chatbox, you can ask:
"Tell me a joke about large language models."
Updated Jan 31, 2025
Version 2.0pauledwards
Microsoft
Joined June 19, 2019
Azure High Performance Computing (HPC) Blog
Follow this blog board to get notified when there's new activity