# GPU Cluster - gpu.euw.container.mom

Obsidian: Projects/gpu-euw-container-mom

## Cluster Info

- **API**: https://api.gpu.euw.container.mom:6443
- **Platform**: OCP 4.20 on RHEL CoreOS 9.6
- **Kernel**: 5.14.0-570.73.1.el9_6.x86_64
- **Nodes**: maple (192.168.0.100), oak (192.168.0.53), willow (192.168.0.149)
- **Hardware**: Dell PowerEdge R630

## Hardware per Node

| Component | Details |
|-----------|---------|
| GPU | NVIDIA Tesla P4 (GP104GL) at PCI 03:00.0 |
| NIC | Mellanox ConnectX-4 Lx at PCI 81:00.0, 81:00.1 |
| GPU Memory | 7680 MiB |
| NIC Ports | 2 (RoCEv2 capable) |

## Operators Installed

| Operator | Namespace | Version |
|----------|-----------|---------|
| Node Feature Discovery | openshift-nfd | 4.20.0-202512081147 |
| NVIDIA GPU Operator | nvidia-gpu-operator | 25.10.1 |
| NVIDIA Network Operator | nvidia-network-operator | 25.10.0 |
| SR-IOV Network Operator | openshift-sriov-network-operator | 4.20.0-202512081147 |

## Driver Versions

| Driver | Version |
|--------|---------|
| NVIDIA Driver | 580.105.08 |
| CUDA | 13.0 |
| DOCA/MOFED | 25.10-1.2.8.0-2 (doca3.2.0) |

## Custom Resources

- **ClusterPolicy**: `gpu-cluster-policy` with RDMA enabled
- **NicClusterPolicy**: `nic-cluster-policy` with DOCA driver
- **NodeFeatureDiscovery**: `nfd-instance` in openshift-nfd

## Node Resources

All nodes expose:
- `nvidia.com/gpu: 1`
- `rdma/rdma_shared_device_a: 63`

## Common Commands

```bash
# Check status
oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.status.state}'
oc get nicclusterpolicy nic-cluster-policy -o jsonpath='{.status.state}'

# View pods
oc get pods -n nvidia-gpu-operator
oc get pods -n nvidia-network-operator

# Run nvidia-smi (must use driver container, not host)
oc exec -n nvidia-gpu-operator $(oc get pods -n nvidia-gpu-operator -l app=nvidia-driver-daemonset -o jsonpath='{.items[0].metadata.name}') -c nvidia-driver-ctr -- nvidia-smi

# Check RDMA devices
oc exec -n nvidia-network-operator $(oc get pods -n nvidia-network-operator -l app=mofed -o jsonpath='{.items[0].metadata.name}') -c mofed-container -- ls /sys/class/infiniband/

# Check node resources
oc get nodes -o json | jq '.items[] | {name: .metadata.name, gpu: .status.allocatable["nvidia.com/gpu"], rdma: .status.allocatable["rdma/rdma_shared_device_a"]}'
```

## Configuration Files

- `nicclusterpolicy.yaml` - NicClusterPolicy manifest with DOCA driver config

## Documentation

See `notes/` directory:
- `cluster-state.txt` - Initial cluster state before setup
- `deployment-complete.txt` - Final deployment status
- `setup-guide.txt` - Step-by-step setup guide
- `troubleshooting.txt` - Issues encountered and solutions
- `hardware-inventory.txt` - Node hardware details

## Key Configuration Notes

1. DOCA driver version must match RHEL version: `doca3.2.0-25.10-1.2.8.0-2` for RHEL 9.6
2. RDMA shared device plugin: use v1.5.2 (v1.6.0 doesn't exist)
3. Must set `UNLOAD_STORAGE_MODULES=true` in ofedDriver.env
4. GPU drivers wait for MOFED when `driver.rdma.enabled=true`
5. ConnectX-4 Lx uses RoCEv2 (Ethernet), not InfiniBand
6. nvidia-smi and ibstat are in containers, not on host RHCOS