Paul Edwards - Principal Technical Program Manager - Azure Core HPC & AI
Dr. Wolfgang De Salvador - Senior Product Manager - Azure Storage
Introduction
This guide demonstrates how to use ZFS LocalPV to efficiently manage the NVMe storage available on several Azure HPC/AI VM types in the H-series and N-series family. For example, Azure NDv5 H100 VMs are equipped with eight 3.5TB NVMe disks, these VMs are tailored for high-performance workloads like AI/ML and large-scale data processing. By combining the flexibility of AKS with the advanced storage capabilities of ZFS, you can dynamically provision stateful node-local volumes while aggregating NVMe disks for optimal performance.
We'll walk through the process of setting up ZFS, from installing the kernel module to creating a ZFS pool using the NVMe disks. You'll learn how to provision persistent volumes (PVs) with ZFS LocalPV and validate the setup by running the FIO benchmark to measure its performance.
Efficiently leveraging the NVMe disks can significantly enhance workloads requiring high storage throughput and low latency. ZFS's unique features, such as pooling and data integrity, make it an excellent choice for managing NVMe storage in Kubernetes environments.
ZFS LocalPV enables the dynamic provisioning of persistent node-local volumes and filesystems within Kubernetes, integrated with the ZFS data storage stack. While ZFS brings additional features like compression and data integrity, its primary advantage in this setup is its ability to aggregate NVMe disks into a unified pool for seamless integration with Kubernetes persistent volumes (PVs).
Comparing Options for Utilizing Local NVMe Storage
The following table summarizes four approaches for managing NVMe disks in Kubernetes, focusing on their ability to aggregate disks and integrate with Kubernetes persistent volumes (PVs):
Name | Aggregate Disks | Managed by PVs | Comments |
---|---|---|---|
Host Path Mount | Yes | No | Offers direct disk control, but lacks integration with Kubernetes and pod isolation. |
Azure Container Storage | No | Yes | Allows slicing of NVMe disks into multiple PVs but doesn’t support aggregation. |
Local Provisioner | No | Yes | Maps each NVMe disk to a PV with a simple configuration. |
ZFS LocalPV | Yes | Yes | Supports disk aggregation and Kubernetes PVs but requires installing the ZFS kernel module. |
Among these options, ZFS LocalPV stands out as the only solution offering both disk aggregation and PV management. However, its setup requires installing the ZFS kernel module and configuring ZFS on the host, which can be a barrier in environments with strict kernel module policies.
Installation
To use ZFS with local NVMEs on AKS, you need to install ZFS on the host node and set up a ZFS pool. This involves installing kernel modules and managing disk partitioning directly on the host. Before proceeding, ensure that your environment allows kernel module installation and that you have the necessary administrative permissions to perform these operations.
Prerequisites and Considerations
- Host Packages: The installation process will modify the host by installing required packages, including:
- gdisk: For disk partitioning and management.
- zfsutils-linux: For ZFS user-space utilities.
- kmod: For managing kernel modules, including loading the ZFS module.
- Security Context: This setup requires administrative privileges because it involves modifying the host's file system, managing kernel modules, and partitioning disks. The DaemonSet uses a privileged security context to ensure it has the necessary permissions to interact with the host.
- Kernel Compatibility: Ensure that your host's kernel version is compatible with ZFS. This has been tested and runs with the latest Ubuntu AKS images.
Installing ZFS and Setting Up a ZFS Pool
The following DaemonSet can be deployed to prepare AKS nodes for ZFS. Please be aware that the nodeAffinity specification is currently covering many of the Azure HPC/AI SKUs having an NVME on board, but it may require extension on your specific use case.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: zfs-host-setup
namespace: kube-system
spec:
selector:
matchLabels:
name: zfs-host-setup
template:
metadata:
labels:
name: zfs-host-setup
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- Standard_HB120-16rs_v2
- Standard_HB120-32rs_v2
- Standard_HB120-64rs_v2
- Standard_HB120-96rs_v2
- Standard_HB120rs_v2
- Standard_HB120-16rs_v3
- Standard_HB120-32rs_v3
- Standard_HB120-64rs_v3
- Standard_HB120-96rs_v3
- Standard_HB120rs_v3
- Standard_HB176-24rs_v4
- Standard_HB176-48rs_v4
- Standard_HB176-96rs_v4
- Standard_HB176-144rs_v4
- Standard_HB176rs_v4
- Standard_HX176-24rs
- Standard_HX176-48rs
- Standard_HX176-96rs
- Standard_HX176-144rs
- Standard_HX176rs
- Standard_NC24ads_A100_v4
- Standard_NC48ads_A100_v4
- Standard_NC96ads_A100_v4
- Standard_ND96isr_H100_v5
- Standard_NC40ads_H100_v5
- Standard_NC80adis_H100_v5
- Standard_NCC40ads_H100_v5
- Standard_ND96isr_H200_v5
- Standard_ND96isr_MI300X_v5
- Standard_ND96amsr_A100_v4
hostNetwork: true
hostPID: true
containers:
- name: zfs-host-setup
image: ubuntu:22.04
securityContext:
privileged: true
volumeMounts:
- name: host-root
mountPath: /host
command:
- bash
- -c
- |
set -euo pipefail
export NUMBER_OF_NVME=$(($(ls /dev/nvme*n1 | wc -l) - 1))
echo "Starting ZFS setup on node $(hostname)..."
echo "Updating package repository on the host..."
chroot /host apt update
echo "Installing required packages on the host..."
chroot /host apt install -y gdisk zfsutils-linux kmod
echo "Checking if ZFS pool 'zfspv-pool' already exists on the host..."
if chroot /host zpool list | grep -q zfspv-pool; then
echo "ZFS pool 'zfspv-pool' already exists. Skipping disk setup."
else
echo "ZFS pool does not exist. Proceeding with disk setup on the host..."
for disk in $(eval echo "/dev/nvme{0..$NUMBER_OF_NVME}n1"); do
echo "Preparing disk $disk..."
if [ -e $disk ]; then
chroot /host sgdisk --zap-all $disk
chroot /host sgdisk --new=1:0:0 $disk
else
echo "Disk $disk not found. Skipping."
fi
done
echo "Loading ZFS kernel module on the host..."
chroot /host modprobe zfs
echo "Creating ZFS pool 'zfspv-pool' on the host..."
chroot /host zpool create zfspv-pool $(eval echo "/dev/nvme{0..$NUMBER_OF_NVME}n1") || echo "Failed to create ZFS pool."
fi
echo "ZFS setup complete. Keeping pod running..."
sleep inf
volumes:
- name: host-root
hostPath:
path: /
Installing zfs-localpv
The zfs-localpv can be installed using Helm:
helm install zfs-localpv https://openebs.github.io/zfs-localpv/zfs-localpv-2.6.2.tgz \
-n openebs --create-namespace --skip-crds
Example Usage
We will validate the ZFS setup with simple IO benchamrk. The tool we will use is called FIO (Flexible I/O Tester). Running the example involves the following steps:
- Create the StorageClass
- Create the PersistentVolumeClaim (PVC)
- Create a pod to run FIO on the PV
StorageClass Manifest
The following StorageClass defines how storage is provisioned using ZFS. It specifies the ZFS pool to use (zfspv-pool) and performance-related settings.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: zfs-local
parameters:
recordsize: "128k"
compression: "off"
dedup: "off"
fstype: "zfs"
poolname: "zfspv-pool"
provisioner: zfs.csi.openebs.io
allowedTopologies:
- matchLabelExpressions:
- key: node.kubernetes.io/instance-type
values:
- Standard_HB120-16rs_v2
- Standard_HB120-32rs_v2
- Standard_HB120-64rs_v2
- Standard_HB120-96rs_v2
- Standard_HB120rs_v2
- Standard_HB120-16rs_v3
- Standard_HB120-32rs_v3
- Standard_HB120-64rs_v3
- Standard_HB120-96rs_v3
- Standard_HB120rs_v3
- Standard_HB176-24rs_v4
- Standard_HB176-48rs_v4
- Standard_HB176-96rs_v4
- Standard_HB176-144rs_v4
- Standard_HB176rs_v4
- Standard_HX176-24rs
- Standard_HX176-48rs
- Standard_HX176-96rs
- Standard_HX176-144rs
- Standard_HX176rs
- Standard_NC24ads_A100_v4
- Standard_NC48ads_A100_v4
- Standard_NC96ads_A100_v4
- Standard_ND96isr_H100_v5
- Standard_NC40ads_H100_v5
- Standard_NC80adis_H100_v5
- Standard_NCC40ads_H100_v5
- Standard_ND96isr_H200_v5
- Standard_ND96isr_MI300X_v5
- Standard_ND96amsr_A100_v4
Note: Update the nodeSelector with additional SKUs if required in your deployment
PersistentVolumeClaim (PVC) Manifest
This PVC requests storage from the StorageClass defined above specifying the size of 24Ti.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: zfs-pvc
spec:
storageClassName: zfs-local
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 24Ti
Pod Manifest
This manifest defines a workload to test the storage setup using FIO.
apiVersion: v1
kind: Pod
metadata:
name: zfs-fio
spec:
containers:
- name: zfs-fio
image: ubuntu:22.04
command:
- bash
- -c
- |
apt update
apt install -y fio
fio --name=bandwidth_test --filename=/mnt/data/testfile --rw=write --size=2T --bs=1M --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --time_based --group_reporting
volumeMounts:
- mountPath: /mnt/data
name: zfs-mount
volumes:
- name: zfs-mount
persistentVolumeClaim:
claimName: zfs-pvc
Output
Running this example will produce performance metrics, such as:
bandwidth_test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=32
...
fio-3.28
Starting 16 processes
bandwidth_test: Laying out IO file (1 file / 2097152MiB)
bandwidth_test: (groupid=0, jobs=16): err= 0: pid=773: Fri Jan 3 19:53:38 2025
write: IOPS=41.3k, BW=40.3GiB/s (43.3GB/s)(2421GiB/60001msec); 0 zone resets
clat (usec): min=109, max=12469, avg=376.13, stdev=131.82
lat (usec): min=122, max=12483, avg=386.56, stdev=132.08
clat percentiles (usec):
| 1.00th=[ 184], 5.00th=[ 245], 10.00th=[ 269], 20.00th=[ 297],
| 30.00th=[ 314], 40.00th=[ 334], 50.00th=[ 355], 60.00th=[ 379],
| 70.00th=[ 408], 80.00th=[ 449], 90.00th=[ 506], 95.00th=[ 562],
| 99.00th=[ 734], 99.50th=[ 832], 99.90th=[ 1139], 99.95th=[ 1319],
| 99.99th=[ 2769]
bw ( MiB/s): min=31246, max=47698, per=100.00%, avg=41350.37, stdev=295.69, samples=1904
iops : min=31246, max=47698, avg=41350.37, stdev=295.69, samples=1904
lat (usec) : 250=5.86%, 500=83.42%, 750=9.82%, 1000=0.71%
lat (msec) : 2=0.18%, 4=0.01%, 10=0.01%, 20=0.01%
cpu : usr=3.07%, sys=62.76%, ctx=4646160, majf=0, minf=25640
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2478994,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=40.3GiB/s (43.3GB/s), 40.3GiB/s-40.3GiB/s (43.3GB/s-43.3GB/s), io=2421GiB (2599GB), run=60001-60001msec
This demonstrates the impressive bandwidth capabilities achievable with ZFS on Azure NDv5 H100 VMs.
Updated Jan 09, 2025
Version 3.0pauledwards
Microsoft
Joined June 19, 2019
Azure High Performance Computing (HPC) Blog
Follow this blog board to get notified when there's new activity