Running mpirun across multi-node GPU workers on Kubernetes pods is surprisingly tricky. Here’s what I learned debugging exit code 255 with zero error messages.
The Setup
A CPU head node orchestrates an NCCL parameter sweep. GPU worker nodes are MPI targets — rank 0 runs mpirun, which SSHes into all workers to launch orted (OpenMPI’s remote daemon). Simple in theory, painful in practice.
Root Cause 1: Hostname Resolution
socket.gethostname() on Kubernetes returns FQDNs like pod-abc123.namespace.svc.cluster.local. mpirun strips the domain suffix and tries to SSH to just pod-abc123, which DNS can’t resolve.
Fix: Use actual numeric IPs in the hostfile and SSH config.
import socket
hostname = socket.gethostname()
ip = socket.gethostbyname(hostname) # "240.4.146.7"
Hostfile should contain 240.4.146.7 slots=8, not the FQDN.
Root Cause 2: Empty LD_LIBRARY_PATH on Remote
mpirun SSHes to remote nodes to launch orted. The SSH session starts with a clean environment — LD_LIBRARY_PATH is empty. orted can’t find shared libraries and fails silently with exit 255 in 0 seconds. No error message. Nothing.
Fix: Both sides must cooperate:
# Client side (~/.ssh/config) — forward the env var
Host 240.4.146.7
SendEnv LD_LIBRARY_PATH
# Server side (sshd) — accept it
/usr/sbin/sshd -o AcceptEnv=LD_LIBRARY_PATH ...
Without AcceptEnv, the server silently drops the forwarded variable. Without SendEnv, the client never sends it. You need both.
Root Cause 3: System sshd Conflict
Kubernetes pods typically already run a system sshd on port 22. Two problems:
- Modifying
/etc/ssh/sshd_confighas no effect — the running sshd loaded the old config at startup - Starting another sshd on port 22 silently fails — port already bound
Fix: Start your own sshd instance on a dynamically-assigned port with its own host keys and authorized_keys:
/usr/sbin/sshd \
-h /run/my-app/etc/sshd/ssh_host_rsa_key \
-h /run/my-app/etc/sshd/ssh_host_ecdsa_key \
-h /run/my-app/etc/sshd/ssh_host_ed25519_key \
-o PasswordAuthentication=no \
-o AuthorizedKeysFile=/run/my-app/etc/sshd/authorized_keys \
-o StrictModes=no \
-o UsePAM=yes \
-o AcceptEnv=LD_LIBRARY_PATH \
-p 48231 \
-D
Getting a Free Port
Don’t use random.randint() — you might collide with an existing service. Let the OS assign a guaranteed-free port:
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", 0))
port = s.getsockname()[1]
Bind to port 0, the OS picks a free one. Close the socket, immediately start sshd on that port. Tiny race window but practically zero chance of collision.
The Complete SSH Config Pattern
Each worker generates its own keypair and starts sshd on a free port. Then all workers exchange pubkeys and ports. The resulting SSH client config looks like:
Host 240.4.98.82
Port 59327
StrictHostKeyChecking no
IdentityFile /run/my-app/etc/ssh/id_ed25519
HostName 240.4.98.82
LogLevel ERROR
SendEnv LD_LIBRARY_PATH
Per-host entries with actual IPs and per-node ports. mpirun reads ~/.ssh/config automatically.
mpirun Flags Reference
/opt/amazon/openmpi/bin/mpirun \
--hostfile /tmp/hostfile \ # IP-based hostfile
--map-by ppr:1:node \ # 1 process per node
--allow-run-as-root \ # containers run as root
--prefix /opt/amazon/openmpi \ # tells remote orted where MPI lives
-x LD_LIBRARY_PATH \ # forward env vars via mpirun too
-x PATH \
-x NCCL_DEBUG \
all_reduce_perf -b 8 -e 1G -g 8
--prefix is critical — it tells remote nodes where to find orted without relying on PATH.
Summary
| Issue | Symptom | Fix |
|---|---|---|
| FQDN in hostfile | Could not resolve hostname | Use numeric IPs from gethostbyname() |
Empty LD_LIBRARY_PATH | orted fails, exit 255, 0 seconds | SendEnv + AcceptEnv LD_LIBRARY_PATH |
| System sshd conflict | Config changes ignored | Own sshd on free port with own keys |
| Port collision | sshd fails to start | bind(("", 0)) for OS-assigned port |
mpirun can’t find orted | exit 255 | --prefix /opt/amazon/openmpi |
The frustrating part: mpirun gives you exit 255 for all of these with no explanation. The debugging strategy is to SSH manually between nodes and check what environment the remote session gets — that’s where the answers are.