Blog Post

Azure High Performance Computing (HPC) Blog
7 MIN READ

Deploying ZFS Scratch Storage for NVMe on Azure Kubernetes Service (AKS)

pauledwards's avatar
pauledwards
Icon for Microsoft rankMicrosoft
Jan 08, 2025

Paul Edwards - Principal Technical Program Manager - Azure Core HPC & AI

Dr. Wolfgang De Salvador - Senior Product Manager - Azure Storage

Introduction

This guide demonstrates how to use ZFS LocalPV to efficiently manage the NVMe storage available on several Azure HPC/AI VM types in the H-series and N-series family. For example, Azure NDv5 H100 VMs are equipped with eight 3.5TB NVMe disks, these VMs are tailored for high-performance workloads like AI/ML and large-scale data processing. By combining the flexibility of AKS with the advanced storage capabilities of ZFS, you can dynamically provision stateful node-local volumes while aggregating NVMe disks for optimal performance.

We'll walk through the process of setting up ZFS, from installing the kernel module to creating a ZFS pool using the NVMe disks. You'll learn how to provision persistent volumes (PVs) with ZFS LocalPV and validate the setup by running the FIO benchmark to measure its performance.

Efficiently leveraging the NVMe disks can significantly enhance workloads requiring high storage throughput and low latency. ZFS's unique features, such as pooling and data integrity, make it an excellent choice for managing NVMe storage in Kubernetes environments.

ZFS LocalPV enables the dynamic provisioning of persistent node-local volumes and filesystems within Kubernetes, integrated with the ZFS data storage stack. While ZFS brings additional features like compression and data integrity, its primary advantage in this setup is its ability to aggregate NVMe disks into a unified pool for seamless integration with Kubernetes persistent volumes (PVs).

Comparing Options for Utilizing Local NVMe Storage

The following table summarizes four approaches for managing NVMe disks in Kubernetes, focusing on their ability to aggregate disks and integrate with Kubernetes persistent volumes (PVs):

Name Aggregate Disks Managed by PVs Comments
Host Path Mount Yes No Offers direct disk control, but lacks integration with Kubernetes and pod isolation.
Azure Container Storage No Yes Allows slicing of NVMe disks into multiple PVs but doesn’t support aggregation.
Local Provisioner No Yes Maps each NVMe disk to a PV with a simple configuration.
ZFS LocalPV Yes Yes Supports disk aggregation and Kubernetes PVs but requires installing the ZFS kernel module.

Among these options, ZFS LocalPV stands out as the only solution offering both disk aggregation and PV management. However, its setup requires installing the ZFS kernel module and configuring ZFS on the host, which can be a barrier in environments with strict kernel module policies.

Installation

To use ZFS with local NVMEs on AKS, you need to install ZFS on the host node and set up a ZFS pool. This involves installing kernel modules and managing disk partitioning directly on the host. Before proceeding, ensure that your environment allows kernel module installation and that you have the necessary administrative permissions to perform these operations.

Prerequisites and Considerations

  • Host Packages: The installation process will modify the host by installing required packages, including:
    • gdisk: For disk partitioning and management.
    • zfsutils-linux: For ZFS user-space utilities.
    • kmod: For managing kernel modules, including loading the ZFS module.
  • Security Context: This setup requires administrative privileges because it involves modifying the host's file system, managing kernel modules, and partitioning disks. The DaemonSet uses a privileged security context to ensure it has the necessary permissions to interact with the host.
  • Kernel Compatibility: Ensure that your host's kernel version is compatible with ZFS. This has been tested and runs with the latest Ubuntu AKS images.

Installing ZFS and Setting Up a ZFS Pool

The following DaemonSet can be deployed to prepare AKS nodes for ZFS. Please be aware that the nodeAffinity specification is currently covering many of the Azure HPC/AI SKUs having an NVME on board, but it may require extension on your specific use case. 

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: zfs-host-setup
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: zfs-host-setup
  template:
    metadata:
      labels:
        name: zfs-host-setup
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - Standard_HB120-16rs_v2
                - Standard_HB120-32rs_v2
                - Standard_HB120-64rs_v2
                - Standard_HB120-96rs_v2
                - Standard_HB120rs_v2
                - Standard_HB120-16rs_v3
                - Standard_HB120-32rs_v3
                - Standard_HB120-64rs_v3
                - Standard_HB120-96rs_v3
                - Standard_HB120rs_v3
                - Standard_HB176-24rs_v4
                - Standard_HB176-48rs_v4
                - Standard_HB176-96rs_v4
                - Standard_HB176-144rs_v4
                - Standard_HB176rs_v4
                - Standard_HX176-24rs
                - Standard_HX176-48rs
                - Standard_HX176-96rs
                - Standard_HX176-144rs
                - Standard_HX176rs
                - Standard_NC24ads_A100_v4
                - Standard_NC48ads_A100_v4
                - Standard_NC96ads_A100_v4
                - Standard_ND96isr_H100_v5
                - Standard_NC40ads_H100_v5
                - Standard_NC80adis_H100_v5
                - Standard_NCC40ads_H100_v5
                - Standard_ND96isr_H200_v5
                - Standard_ND96isr_MI300X_v5
                - Standard_ND96amsr_A100_v4
      hostNetwork: true
      hostPID: true
      containers:
      - name: zfs-host-setup
        image: ubuntu:22.04
        securityContext:
          privileged: true
        volumeMounts:
        - name: host-root
          mountPath: /host
        command:
          - bash
          - -c
          - |
            set -euo pipefail
            export NUMBER_OF_NVME=$(($(ls /dev/nvme*n1 | wc -l) - 1))
            echo "Starting ZFS setup on node $(hostname)..."

            echo "Updating package repository on the host..."
            chroot /host apt update

            echo "Installing required packages on the host..."
            chroot /host apt install -y gdisk zfsutils-linux kmod

            echo "Checking if ZFS pool 'zfspv-pool' already exists on the host..."
            if chroot /host zpool list | grep -q zfspv-pool; then
              echo "ZFS pool 'zfspv-pool' already exists. Skipping disk setup."
            else
              echo "ZFS pool does not exist. Proceeding with disk setup on the host..."
              for disk in $(eval echo "/dev/nvme{0..$NUMBER_OF_NVME}n1"); do
                echo "Preparing disk $disk..."
                if [ -e $disk ]; then
                  chroot /host sgdisk --zap-all $disk
                  chroot /host sgdisk --new=1:0:0 $disk
                else
                  echo "Disk $disk not found. Skipping."
                fi
              done

              echo "Loading ZFS kernel module on the host..."
              chroot /host modprobe zfs

              echo "Creating ZFS pool 'zfspv-pool' on the host..."
              chroot /host zpool create zfspv-pool $(eval echo "/dev/nvme{0..$NUMBER_OF_NVME}n1") || echo "Failed to create ZFS pool."
            fi

            echo "ZFS setup complete. Keeping pod running..."
            sleep inf
      volumes:
      - name: host-root
        hostPath:
          path: /

Installing zfs-localpv

The zfs-localpv can be installed using Helm:

helm install zfs-localpv https://openebs.github.io/zfs-localpv/zfs-localpv-2.6.2.tgz \
 -n openebs --create-namespace --skip-crds

Example Usage

We will validate the ZFS setup with simple IO benchamrk. The tool we will use is called FIO (Flexible I/O Tester). Running the example involves the following steps:

  • Create the StorageClass
  • Create the PersistentVolumeClaim (PVC)
  • Create a pod to run FIO on the PV

StorageClass Manifest

The following StorageClass defines how storage is provisioned using ZFS. It specifies the ZFS pool to use (zfspv-pool) and performance-related settings.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: zfs-local
parameters:
  recordsize: "128k"
  compression: "off"
  dedup: "off"
  fstype: "zfs"
  poolname: "zfspv-pool"
provisioner: zfs.csi.openebs.io
allowedTopologies:
- matchLabelExpressions:
  - key: node.kubernetes.io/instance-type
    values:
    - Standard_HB120-16rs_v2
    - Standard_HB120-32rs_v2
    - Standard_HB120-64rs_v2
    - Standard_HB120-96rs_v2
    - Standard_HB120rs_v2
    - Standard_HB120-16rs_v3
    - Standard_HB120-32rs_v3
    - Standard_HB120-64rs_v3
    - Standard_HB120-96rs_v3
    - Standard_HB120rs_v3
    - Standard_HB176-24rs_v4
    - Standard_HB176-48rs_v4
    - Standard_HB176-96rs_v4
    - Standard_HB176-144rs_v4
    - Standard_HB176rs_v4
    - Standard_HX176-24rs
    - Standard_HX176-48rs
    - Standard_HX176-96rs
    - Standard_HX176-144rs
    - Standard_HX176rs
    - Standard_NC24ads_A100_v4
    - Standard_NC48ads_A100_v4
    - Standard_NC96ads_A100_v4
    - Standard_ND96isr_H100_v5
    - Standard_NC40ads_H100_v5
    - Standard_NC80adis_H100_v5
    - Standard_NCC40ads_H100_v5
    - Standard_ND96isr_H200_v5
    - Standard_ND96isr_MI300X_v5
    - Standard_ND96amsr_A100_v4

Note: Update the nodeSelector with additional SKUs if required in your deployment

PersistentVolumeClaim (PVC) Manifest

This PVC requests storage from the StorageClass defined above specifying the size of 24Ti.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: zfs-pvc
spec:
  storageClassName: zfs-local
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 24Ti

Pod Manifest

This manifest defines a workload to test the storage setup using FIO.

apiVersion: v1
kind: Pod
metadata:
  name: zfs-fio
spec:
  containers:
  - name: zfs-fio
    image: ubuntu:22.04
    command:
      - bash
      - -c
      - |
        apt update
        apt install -y fio
        fio --name=bandwidth_test --filename=/mnt/data/testfile --rw=write --size=2T --bs=1M --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --time_based --group_reporting
    volumeMounts:
    - mountPath: /mnt/data
      name: zfs-mount
  volumes:
  - name: zfs-mount
    persistentVolumeClaim:
      claimName: zfs-pvc

Output

Running this example will produce performance metrics, such as:

bandwidth_test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=32
...
fio-3.28
Starting 16 processes
bandwidth_test: Laying out IO file (1 file / 2097152MiB)

bandwidth_test: (groupid=0, jobs=16): err= 0: pid=773: Fri Jan  3 19:53:38 2025
  write: IOPS=41.3k, BW=40.3GiB/s (43.3GB/s)(2421GiB/60001msec); 0 zone resets
    clat (usec): min=109, max=12469, avg=376.13, stdev=131.82
     lat (usec): min=122, max=12483, avg=386.56, stdev=132.08
    clat percentiles (usec):
     |  1.00th=[  184],  5.00th=[  245], 10.00th=[  269], 20.00th=[  297],
     | 30.00th=[  314], 40.00th=[  334], 50.00th=[  355], 60.00th=[  379],
     | 70.00th=[  408], 80.00th=[  449], 90.00th=[  506], 95.00th=[  562],
     | 99.00th=[  734], 99.50th=[  832], 99.90th=[ 1139], 99.95th=[ 1319],
     | 99.99th=[ 2769]
   bw (  MiB/s): min=31246, max=47698, per=100.00%, avg=41350.37, stdev=295.69, samples=1904
   iops        : min=31246, max=47698, avg=41350.37, stdev=295.69, samples=1904
  lat (usec)   : 250=5.86%, 500=83.42%, 750=9.82%, 1000=0.71%
  lat (msec)   : 2=0.18%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=3.07%, sys=62.76%, ctx=4646160, majf=0, minf=25640
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2478994,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=40.3GiB/s (43.3GB/s), 40.3GiB/s-40.3GiB/s (43.3GB/s-43.3GB/s), io=2421GiB (2599GB), run=60001-60001msec

This demonstrates the impressive bandwidth capabilities achievable with ZFS on Azure NDv5 H100 VMs.

Updated Jan 09, 2025
Version 3.0
No CommentsBe the first to comment