Skip to content
Posts en inglés. Usá el traductor del navegador para leerlos en tu idioma.

How to Deploy Slurm Workload Manager on Kubernetes with Slinky

Yammbo
· 8 min read
slinky operator hpc workloads gpu acceleration kubernetes operators rdma networking
How to Deploy Slurm Workload Manager on Kubernetes with Slinky

Slurm is a widely adopted workload manager for high-performance computing (HPC) clusters and a significant portion of AI training environments. It efficiently allocates resources, queues jobs, and manages execution across compute nodes. Traditionally, operating Slurm involved extensive bare-metal management, including imaging nodes, synchronizing packages, and maintaining daemon services. This hands-on overhead can distract from the primary goal of running computational tasks or training models.

Slinky, developed by SchedMD, addresses this challenge by providing a Kubernetes operator that integrates Slurm with Kubernetes. With Slinky, core Slurm components like the controller (slurmctld), worker daemons (slurmd), and login nodes run as Kubernetes pods. This architecture allows Kubernetes to handle the underlying lifecycle management, hardware scheduling, restarts, and rolling upgrades, while Slurm continues to manage job scheduling and user interactions (e.g., sbatch, srun). This tutorial guides you through setting up a functional Slinky cluster on a Kubernetes environment with GPU nodes, shared storage, and validating high-speed inter-node communication.

Step 1: Prepare Your Kubernetes Environment

Before deploying Slinky, you need a robust Kubernetes cluster configured for HPC workloads. This involves setting up a virtual private cloud (VPC), specialized node pools, shared storage, and a container registry.

  1. Create a Virtual Private Cloud (VPC): A VPC provides an isolated network environment for your cluster, enhancing security and allowing for private IP communication between resources. Ensure your VPC is in a region that supports managed NFS, which is crucial for shared storage. Consult your cloud provider's documentation for creating a VPC.
  2. Provision a Kubernetes Cluster: Your cluster requires at least two distinct node pools to efficiently separate control plane and worker responsibilities:

    • Management Node Pool: A small pool of CPU-optimized nodes (e.g., 3 instances with 4 vCPUs / 8 GiB RAM) for the Slurm control plane (slurmctld, login pods). These nodes handle administrative tasks and are less resource-intensive.
    • GPU Worker Node Pool: A pool of GPU-accelerated nodes (e.g., 2+ instances of NVIDIA B300 droplets, each with 8 GPUs and 16 fabric NICs). These nodes will execute the actual computational jobs. Kubernetes typically taints GPU pools with nvidia.com/gpu:NoSchedule and labels them (e.g., doks.digitalocean.com/gpu-brand=nvidia) to ensure only GPU-aware pods are scheduled on these expensive resources.

    Refer to your cloud provider's guide for creating a Kubernetes cluster with multiple node pools.

  3. Set Up Managed NFS: Shared storage is essential for Slurm, allowing login nodes and worker pods to access the same job scripts, input data, and output files (the /shared directory). A managed NFS service provides a reliable, scalable, and easy-to-configure ReadWriteMany persistent volume. Note the Mount Source provided by your NFS service, as you will need it for configuring the PersistentVolume. Learn how to create managed NFS.
  4. Configure a Container Registry: Custom slurmd and login images are often required to include specific drivers, libraries, or tools necessary for your HPC workloads. A container registry (e.g., DigitalOcean Container Registry) provides a secure place to store and retrieve these images. If your Kubernetes cluster integrates with the registry, pull credentials can be automatically managed. See how to create a container registry.
  5. Optimize NFS Performance (Optional but Recommended): For high-throughput applications on GPU nodes, optimizing NFS performance is critical. B300 nodes support jumbo frames (MTU 9000), but pods that mount NFS before the network interface is properly tuned may negotiate at a default MTU 1500, limiting throughput for the lifetime of the mount. While detailed tuning is beyond this introductory guide, be aware that this step is crucial for achieving optimal performance.
  6. Create the Slurm Namespace: Once your cluster is operational, create a dedicated namespace for your Slurm deployment to ensure resource isolation and easier management:

    kubectl create namespace slurm

Step 2: Prepare Custom Slurm Container Images

While Slinky can use upstream Slurm images, many HPC environments require custom slurmd and login images. These customizations might include specific NVIDIA drivers, RDMA libraries, monitoring agents, or specialized client tools that are not part of the base images. The process generally involves:

  1. Starting from Upstream Images: Begin with the official or a trusted base Slurm image.
  2. Adding Necessary Components: Use a Dockerfile to install any required software, libraries, or configurations specific to your environment and GPU hardware. This often includes installing NVIDIA CUDA toolkit components, RDMA user-space libraries, and any other dependencies for your workload.
  3. Building the Images: Use containerization tools like Docker or Podman to build your custom images.
  4. Tagging and Pushing: Tag your newly built images with appropriate versions and push them to your configured container registry. For example: your-registry.com/slurm/slurmd:custom-gpu and your-registry.com/slurm/login:custom-gpu.

This ensures that your Slurm worker and login pods have all the necessary prerequisites to interact with the underlying GPU hardware and high-speed network fabric.

Step 3: Deploy the Slinky Operator Using Helm

Helm is the package manager for Kubernetes and simplifies the deployment of complex applications like Slinky. You'll use it to install the Slinky operator and configure its components.

  1. Add the SchedMD Helm Repository: First, add the official SchedMD Helm chart repository:

    helm repo add slurm-operator https://schedmd.github.io/slurm-operator/helm/charts/slurm-operator/
  2. Update Helm Repositories: Ensure your local Helm repository list is up-to-date:

    helm repo update
  3. Configure Deployment Values: Create a values.yaml file to customize your Slinky deployment. Key configurations include:

    • NFS Persistent Volume: Define a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that point to your managed NFS Mount Source. This ensures the /shared directory is accessible to all Slurm pods.
    • Custom Image References: Specify the paths to your custom slurmd and login images in your container registry.
    • Node Selectors and Tolerations: Configure Slurm components to schedule on the correct node pools. For instance, slurmctld and login pods on the CPU management pool, and slurmd pods on the GPU worker pool, using Kubernetes node selectors and tolerations.
    • Multus NetworkAttachmentDefinition for RDMA: This is crucial for high-performance inter-node communication. Multus CNI allows pods to attach to multiple network interfaces. You'll define NetworkAttachmentDefinition resources to expose the RDMA-capable fabric NICs on your GPU nodes. The Slurm chart then requests these interfaces (e.g., rdma/fabricN) for the slurmd pods.
    • Resource Requests: Specify GPU resources (e.g., nvidia.com/gpu: 8) and RDMA fabric NICs for your slurmd pods.
    • Optional Features: While we'll keep the install minimal, the chart supports enabling accounting (slurmdbd) and Prometheus metrics if needed.

    Here's a conceptual snippet of what your values.yaml might contain:

    slurmctld:  nodeSelector:    doks.digitalocean.com/node-pool: mgmt  tolerations:    - key: "doks.digitalocean.com/node-pool"      operator: "Equal"      value: "mgmt"      effect: "NoSchedule"slurmd:  nodeSelector:    doks.digitalocean.com/node-pool: gpu  tolerations:    - key: "nvidia.com/gpu"      operator: "Exists"      effect: "NoSchedule"  resources:    requests:      nvidia.com/gpu: 8      rdma/fabric0: 1 # Example for Multus RDMA network  image:    repository: your-registry.com/slurm/slurmd    tag: custom-gpu    pullPolicy: Alwayslogin:  nodeSelector:    doks.digitalocean.com/node-pool: mgmt  tolerations:    - key: "doks.digitalocean.com/node-pool"      operator: "Equal"      value: "mgmt"      effect: "NoSchedule"  image:    repository: your-registry.com/slurm/login    tag: custom-gpu    pullPolicy: AlwayspersistentVolume:  enabled: true  nfs:    server: "your-nfs-mount-source"    path: "/"networkAttachmentDefinitions:  - name: rdma-fabric0    config: |      {        "cniVersion": "0.3.1",        "type": "macvlan",        "master": "eth1", # Adjust based on your node's RDMA interface        "mode": "bridge",        "ipam": {          "type": "whereabouts",          "range": "10.244.0.0/16"        }      }
  4. Install Slinky: Deploy Slinky into the slurm namespace using your configured values.yaml:

    helm install slinky slurm-operator/slurm-operator -n slurm -f values.yaml

Step 4: Validate GPU and RDMA Fabric Connectivity

After deployment, it's crucial to validate that your GPUs are accessible and, more importantly, that the RDMA fabric is correctly configured for high-speed, multi-node communication. Simply scheduling pods isn't enough; you need to confirm that collective operations can leverage the full bandwidth of your interconnect.

The NVIDIA Collective Communications Library (NCCL) provides a set of highly optimized routines for inter-GPU communication. Running an NCCL all-reduce benchmark across multiple nodes is the definitive test for validating multi-node GPU communication over an RDMA-enabled network.

  1. Access a Slurm Login Node: Use kubectl exec to get a shell on one of your Slurm login pods:
kubectl exec -it -n slurm $(kubectl get pods -n slurm -l app.kubernetes.io/component=login -o jsonpath='{.items[0].metadata.name}') -- bash

From within the login node, you can submit Slurm jobs.

  1. Submit an NCCL All-Reduce Job: Create a simple Slurm batch script (e.g., nccl_test.sh) that runs an NCCL benchmark. This script should request multiple nodes and GPUs, then execute an NCCL test that performs an all-reduce operation. Ensure your custom slurmd image includes the NCCL tests.

    #!/bin/bash#SBATCH --nodes=2#SBATCH --ntasks-per-node=1#SBATCH --gpus-per-node=8#SBATCH --exclusive#SBATCH --time=00:10:00#SBATCH --job-name=nccl-all-reduce#SBATCH --output=nccl-%j.outecho "Running NCCL all-reduce test..."srun /usr/local/cuda/samples/deviceQuery/deviceQuery # Verify GPUs are seensrun /usr/local/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 8

    Submit the job:

    sbatch nccl_test.sh
  2. Monitor and Verify Output: Check the job output (e.g., nccl-<jobid>.out). Look for successful NCCL initialization across all requested GPUs and nodes, and observe the reported bandwidths. High bandwidth numbers (approaching the theoretical maximums of your RDMA fabric) confirm that multi-node GPU communication over the mlx5_* interfaces is working correctly.

Conclusion

You have successfully deployed a Slurm cluster managed by the Slinky operator on Kubernetes, complete with GPU acceleration, shared NFS storage, and validated high-speed RDMA networking. This setup provides a robust and scalable environment for running HPC and AI workloads, leveraging the operational benefits of Kubernetes while retaining Slurm's powerful job scheduling capabilities. To explore how Yammbo can help you build and manage your online presence, visit yammbo.com.