How to Deploy Slurm Workload Manager on Kubernetes with Slinky
Slurm is a widely adopted workload manager for high-performance computing (HPC) clusters and a significant portion of AI training environments. It efficiently allocates resources, queues jobs, and manages execution across compute nodes. Traditionally, operating Slurm involved extensive bare-metal management, including imaging nodes, synchronizing packages, and maintaining daemon services. This hands-on overhead can distract from the primary goal of running computational tasks or training models.
Slinky, developed by SchedMD, addresses this challenge by providing a Kubernetes operator that integrates Slurm with Kubernetes. With Slinky, core Slurm components like the controller (slurmctld), worker daemons (slurmd), and login nodes run as Kubernetes pods. This architecture allows Kubernetes to handle the underlying lifecycle management, hardware scheduling, restarts, and rolling upgrades, while Slurm continues to manage job scheduling and user interactions (e.g., sbatch, srun). This tutorial guides you through setting up a functional Slinky cluster on a Kubernetes environment with GPU nodes, shared storage, and validating high-speed inter-node communication.
Step 1: Prepare Your Kubernetes Environment
Before deploying Slinky, you need a robust Kubernetes cluster configured for HPC workloads. This involves setting up a virtual private cloud (VPC), specialized node pools, shared storage, and a container registry.
- Create a Virtual Private Cloud (VPC): A VPC provides an isolated network environment for your cluster, enhancing security and allowing for private IP communication between resources. Ensure your VPC is in a region that supports managed NFS, which is crucial for shared storage. Consult your cloud provider's documentation for creating a VPC.
- Provision a Kubernetes Cluster: Your cluster requires at least two distinct node pools to efficiently separate control plane and worker responsibilities:
- Management Node Pool: A small pool of CPU-optimized nodes (e.g., 3 instances with 4 vCPUs / 8 GiB RAM) for the Slurm control plane (
slurmctld, login pods). These nodes handle administrative tasks and are less resource-intensive. - GPU Worker Node Pool: A pool of GPU-accelerated nodes (e.g., 2+ instances of NVIDIA B300 droplets, each with 8 GPUs and 16 fabric NICs). These nodes will execute the actual computational jobs. Kubernetes typically taints GPU pools with
nvidia.com/gpu:NoScheduleand labels them (e.g.,doks.digitalocean.com/gpu-brand=nvidia) to ensure only GPU-aware pods are scheduled on these expensive resources.
Refer to your cloud provider's guide for creating a Kubernetes cluster with multiple node pools.
- Management Node Pool: A small pool of CPU-optimized nodes (e.g., 3 instances with 4 vCPUs / 8 GiB RAM) for the Slurm control plane (
- Set Up Managed NFS: Shared storage is essential for Slurm, allowing login nodes and worker pods to access the same job scripts, input data, and output files (the
/shareddirectory). A managed NFS service provides a reliable, scalable, and easy-to-configureReadWriteManypersistent volume. Note the Mount Source provided by your NFS service, as you will need it for configuring the PersistentVolume. Learn how to create managed NFS. - Configure a Container Registry: Custom
slurmdand login images are often required to include specific drivers, libraries, or tools necessary for your HPC workloads. A container registry (e.g., DigitalOcean Container Registry) provides a secure place to store and retrieve these images. If your Kubernetes cluster integrates with the registry, pull credentials can be automatically managed. See how to create a container registry. - Optimize NFS Performance (Optional but Recommended): For high-throughput applications on GPU nodes, optimizing NFS performance is critical. B300 nodes support jumbo frames (MTU 9000), but pods that mount NFS before the network interface is properly tuned may negotiate at a default MTU 1500, limiting throughput for the lifetime of the mount. While detailed tuning is beyond this introductory guide, be aware that this step is crucial for achieving optimal performance.
- Create the Slurm Namespace: Once your cluster is operational, create a dedicated namespace for your Slurm deployment to ensure resource isolation and easier management:
kubectl create namespace slurm
Step 2: Prepare Custom Slurm Container Images
While Slinky can use upstream Slurm images, many HPC environments require custom slurmd and login images. These customizations might include specific NVIDIA drivers, RDMA libraries, monitoring agents, or specialized client tools that are not part of the base images. The process generally involves:
- Starting from Upstream Images: Begin with the official or a trusted base Slurm image.
- Adding Necessary Components: Use a
Dockerfileto install any required software, libraries, or configurations specific to your environment and GPU hardware. This often includes installing NVIDIA CUDA toolkit components, RDMA user-space libraries, and any other dependencies for your workload. - Building the Images: Use containerization tools like Docker or Podman to build your custom images.
- Tagging and Pushing: Tag your newly built images with appropriate versions and push them to your configured container registry. For example:
your-registry.com/slurm/slurmd:custom-gpuandyour-registry.com/slurm/login:custom-gpu.
This ensures that your Slurm worker and login pods have all the necessary prerequisites to interact with the underlying GPU hardware and high-speed network fabric.
Step 3: Deploy the Slinky Operator Using Helm
Helm is the package manager for Kubernetes and simplifies the deployment of complex applications like Slinky. You'll use it to install the Slinky operator and configure its components.
- Add the SchedMD Helm Repository: First, add the official SchedMD Helm chart repository:
helm repo add slurm-operator https://schedmd.github.io/slurm-operator/helm/charts/slurm-operator/ - Update Helm Repositories: Ensure your local Helm repository list is up-to-date:
helm repo update - Configure Deployment Values: Create a
values.yamlfile to customize your Slinky deployment. Key configurations include:- NFS Persistent Volume: Define a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that point to your managed NFS Mount Source. This ensures the
/shareddirectory is accessible to all Slurm pods. - Custom Image References: Specify the paths to your custom
slurmdandloginimages in your container registry. - Node Selectors and Tolerations: Configure Slurm components to schedule on the correct node pools. For instance,
slurmctldand login pods on the CPU management pool, andslurmdpods on the GPU worker pool, using Kubernetes node selectors and tolerations. - Multus NetworkAttachmentDefinition for RDMA: This is crucial for high-performance inter-node communication. Multus CNI allows pods to attach to multiple network interfaces. You'll define
NetworkAttachmentDefinitionresources to expose the RDMA-capable fabric NICs on your GPU nodes. The Slurm chart then requests these interfaces (e.g.,rdma/fabricN) for theslurmdpods. - Resource Requests: Specify GPU resources (e.g.,
nvidia.com/gpu: 8) and RDMA fabric NICs for yourslurmdpods. - Optional Features: While we'll keep the install minimal, the chart supports enabling accounting (
slurmdbd) and Prometheus metrics if needed.
Here's a conceptual snippet of what your
values.yamlmight contain:slurmctld: nodeSelector: doks.digitalocean.com/node-pool: mgmt tolerations: - key: "doks.digitalocean.com/node-pool" operator: "Equal" value: "mgmt" effect: "NoSchedule"slurmd: nodeSelector: doks.digitalocean.com/node-pool: gpu tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" resources: requests: nvidia.com/gpu: 8 rdma/fabric0: 1 # Example for Multus RDMA network image: repository: your-registry.com/slurm/slurmd tag: custom-gpu pullPolicy: Alwayslogin: nodeSelector: doks.digitalocean.com/node-pool: mgmt tolerations: - key: "doks.digitalocean.com/node-pool" operator: "Equal" value: "mgmt" effect: "NoSchedule" image: repository: your-registry.com/slurm/login tag: custom-gpu pullPolicy: AlwayspersistentVolume: enabled: true nfs: server: "your-nfs-mount-source" path: "/"networkAttachmentDefinitions: - name: rdma-fabric0 config: | { "cniVersion": "0.3.1", "type": "macvlan", "master": "eth1", # Adjust based on your node's RDMA interface "mode": "bridge", "ipam": { "type": "whereabouts", "range": "10.244.0.0/16" } } - NFS Persistent Volume: Define a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that point to your managed NFS Mount Source. This ensures the
- Install Slinky: Deploy Slinky into the
slurmnamespace using your configuredvalues.yaml:helm install slinky slurm-operator/slurm-operator -n slurm -f values.yaml
Step 4: Validate GPU and RDMA Fabric Connectivity
After deployment, it's crucial to validate that your GPUs are accessible and, more importantly, that the RDMA fabric is correctly configured for high-speed, multi-node communication. Simply scheduling pods isn't enough; you need to confirm that collective operations can leverage the full bandwidth of your interconnect.
The NVIDIA Collective Communications Library (NCCL) provides a set of highly optimized routines for inter-GPU communication. Running an NCCL all-reduce benchmark across multiple nodes is the definitive test for validating multi-node GPU communication over an RDMA-enabled network.
- Access a Slurm Login Node: Use
kubectl execto get a shell on one of your Slurm login pods:
kubectl exec -it -n slurm $(kubectl get pods -n slurm -l app.kubernetes.io/component=login -o jsonpath='{.items[0].metadata.name}') -- bashFrom within the login node, you can submit Slurm jobs.
- Submit an NCCL All-Reduce Job: Create a simple Slurm batch script (e.g.,
nccl_test.sh) that runs an NCCL benchmark. This script should request multiple nodes and GPUs, then execute an NCCL test that performs an all-reduce operation. Ensure your customslurmdimage includes the NCCL tests.#!/bin/bash#SBATCH --nodes=2#SBATCH --ntasks-per-node=1#SBATCH --gpus-per-node=8#SBATCH --exclusive#SBATCH --time=00:10:00#SBATCH --job-name=nccl-all-reduce#SBATCH --output=nccl-%j.outecho "Running NCCL all-reduce test..."srun /usr/local/cuda/samples/deviceQuery/deviceQuery # Verify GPUs are seensrun /usr/local/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 8Submit the job:
sbatch nccl_test.sh - Monitor and Verify Output: Check the job output (e.g.,
nccl-<jobid>.out). Look for successful NCCL initialization across all requested GPUs and nodes, and observe the reported bandwidths. High bandwidth numbers (approaching the theoretical maximums of your RDMA fabric) confirm that multi-node GPU communication over themlx5_*interfaces is working correctly.
Conclusion
You have successfully deployed a Slurm cluster managed by the Slinky operator on Kubernetes, complete with GPU acceleration, shared NFS storage, and validated high-speed RDMA networking. This setup provides a robust and scalable environment for running HPC and AI workloads, leveraging the operational benefits of Kubernetes while retaining Slurm's powerful job scheduling capabilities. To explore how Yammbo can help you build and manage your online presence, visit yammbo.com.