# GPU Cluster - gpu.euw.container.mom Obsidian: Projects/gpu-euw-container-mom ## Cluster Info - **API**: https://api.gpu.euw.container.mom:6443 - **Platform**: OCP 4.20 on RHEL CoreOS 9.6 - **Kernel**: 5.14.0-570.73.1.el9_6.x86_64 - **Nodes**: maple (192.168.0.100), oak (192.168.0.53), willow (192.168.0.149) - **Hardware**: Dell PowerEdge R630 ## Hardware per Node | Component | Details | |-----------|---------| | GPU | NVIDIA Tesla P4 (GP104GL) at PCI 03:00.0 | | NIC | Mellanox ConnectX-4 Lx at PCI 81:00.0, 81:00.1 | | GPU Memory | 7680 MiB | | NIC Ports | 2 (RoCEv2 capable) | ## Operators Installed | Operator | Namespace | Version | |----------|-----------|---------| | Node Feature Discovery | openshift-nfd | 4.20.0-202512081147 | | NVIDIA GPU Operator | nvidia-gpu-operator | 25.10.1 | | NVIDIA Network Operator | nvidia-network-operator | 25.10.0 | | SR-IOV Network Operator | openshift-sriov-network-operator | 4.20.0-202512081147 | ## Driver Versions | Driver | Version | |--------|---------| | NVIDIA Driver | 580.105.08 | | CUDA | 13.0 | | DOCA/MOFED | 25.10-1.2.8.0-2 (doca3.2.0) | ## Custom Resources - **ClusterPolicy**: `gpu-cluster-policy` with RDMA enabled - **NicClusterPolicy**: `nic-cluster-policy` with DOCA driver - **NodeFeatureDiscovery**: `nfd-instance` in openshift-nfd ## Node Resources All nodes expose: - `nvidia.com/gpu: 1` - `rdma/rdma_shared_device_a: 63` ## Common Commands ```bash # Check status oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.status.state}' oc get nicclusterpolicy nic-cluster-policy -o jsonpath='{.status.state}' # View pods oc get pods -n nvidia-gpu-operator oc get pods -n nvidia-network-operator # Run nvidia-smi (must use driver container, not host) oc exec -n nvidia-gpu-operator $(oc get pods -n nvidia-gpu-operator -l app=nvidia-driver-daemonset -o jsonpath='{.items[0].metadata.name}') -c nvidia-driver-ctr -- nvidia-smi # Check RDMA devices oc exec -n nvidia-network-operator $(oc get pods -n nvidia-network-operator -l app=mofed -o jsonpath='{.items[0].metadata.name}') -c mofed-container -- ls /sys/class/infiniband/ # Check node resources oc get nodes -o json | jq '.items[] | {name: .metadata.name, gpu: .status.allocatable["nvidia.com/gpu"], rdma: .status.allocatable["rdma/rdma_shared_device_a"]}' ``` ## Configuration Files - `nicclusterpolicy.yaml` - NicClusterPolicy manifest with DOCA driver config ## Documentation See `notes/` directory: - `cluster-state.txt` - Initial cluster state before setup - `deployment-complete.txt` - Final deployment status - `setup-guide.txt` - Step-by-step setup guide - `troubleshooting.txt` - Issues encountered and solutions - `hardware-inventory.txt` - Node hardware details ## Key Configuration Notes 1. DOCA driver version must match RHEL version: `doca3.2.0-25.10-1.2.8.0-2` for RHEL 9.6 2. RDMA shared device plugin: use v1.5.2 (v1.6.0 doesn't exist) 3. Must set `UNLOAD_STORAGE_MODULES=true` in ofedDriver.env 4. GPU drivers wait for MOFED when `driver.rdma.enabled=true` 5. ConnectX-4 Lx uses RoCEv2 (Ethernet), not InfiniBand 6. nvidia-smi and ibstat are in containers, not on host RHCOS