LUMI AMD GPU Provider
The lumi_amd_gpu provider connects QCOS to the LUMI EuroHPC supercomputer, giving access to AMD MI250X GPUs running Qiskit Aer with ROCm. LUMI is one of the largest supercomputers in the world — and its GPU partition is available to researchers and companies through EuroHPC allocations.
Hardware Specs
| Component | Spec |
|---|---|
| GPU | AMD Instinct MI250X |
| GCDs per node | 8 (each GCD = independent GPU-like unit) |
| HBM2e per GCD | 64 GB (128 GB per physical card) |
| Interconnect | Slingshot-11 (200 Gb/s between nodes) |
| CPU per node | AMD EPYC 7A53 64-core |
| NVMe (flash) | 3.75 TB per node |
Max nodes in partition standard-g | 2560 nodes = 20,480 GCDs |
In QCOS terms:
- Each GCD handles up to 34 qubits via exact ROCm statevector (64 GB)
- Each node also runs 1 MPS node with up to 2000 qubits per node
Prerequisites
1. LUMI Access
You need a LUMI allocation. Options:
- EuroHPC Regular Access — competitive calls, free for EU researchers
- EuroHPC Extreme Access — for very large allocations
- LUMI-G pilot — for companies via CSC (Finland)
Contact: lumi-supercomputer.eu
2. SSH Configuration
Add to your ~/.ssh/config:
Host lumi
HostName lumi.csc.fi
User YOUR_LUMI_USERNAME
IdentityFile ~/.ssh/id_ed25519_lumi
ServerAliveInterval 60
Test connection:
ssh lumi hostname
# Expected: lumi-login01.lumi.csc.fi (or similar)
3. Python environment on LUMI
The SynapseX project (project_465002463) already has a venv at /flash/project_465002463/venv_synapsex/. For your own project:
# On LUMI login node
ssh lumi
# Load Python module
module load cray-python/3.11.7
# Create venv (on flash for speed)
python3 -m venv /flash/$LUMI_PROJECT/venv_qcos
source /flash/$LUMI_PROJECT/venv_qcos/bin/activate
# Install Qiskit Aer with ROCm support
pip install qiskit qiskit-aer-gpu
Using the LUMI Provider
Basic registration
from network.node_providers import ClusterBuilder, make_lumi_cluster
from network.distributed_qvm import DistributedQVM, QVMMode
# Simple: 1 LUMI node × 8 GCDs = 8 statevector nodes + 1 MPS node
registry = make_lumi_cluster(
n_nodes=1,
gpus_per_node=8,
lumi_host="lumi",
lumi_project="project_465002463",
include_local_fallback=True, # adds local CPU for small circuits
)
qvm = DistributedQVM(registry, mode=QVMMode.EMULATED, shots=4096)
Multi-node LUMI cluster
# 4 LUMI nodes × 8 GCDs = 32 SV nodes (34q each) + 4 MPS nodes (2000q each)
registry = (
ClusterBuilder()
.add("lumi_amd_gpu",
n_nodes=4,
gpus_per_node=8,
max_qubits_sv=34, # statevector per GCD (RAM limited)
max_qubits_mps=2000, # MPS per node
lumi_host="lumi",
lumi_project="project_465002463")
.add("local_cpu", n_nodes=4) # local fallback for tiny circuits
.build()
)
report = registry.status_report()
print(f"Nodes: {report['total_nodes']}")
print(f"Online: {report['online_nodes']}")
print(f"Total qubits: {report['total_physical_qubits']}")
# Nodes: 37 (32 SV + 4 MPS + 1 local CPU)
# Online: 37
# Total qubits: 9188
Inspecting LUMI node details
for node in registry.all_nodes():
if "lumi" in node.tags:
print(f" {node.node_id:30s} {node.node_type.value:15s} {node.max_qubits:5d}q")
lumi-n00-gcd0 gpu_aer 34q
lumi-n00-gcd1 gpu_aer 34q
...
lumi-n00-gcd7 gpu_aer 34q
lumi-n00-mps gpu_tensor 2000q
lumi-n01-gcd0 gpu_aer 34q
...
Running circuits on LUMI
Currently, LUMI nodes are registered as local Aer nodes with ROCm credentials stored in QPUNodeSpec.credentials. The NodeExecutor routes to Aer automatically.
For direct SLURM batch submission (running the actual Aer simulation on LUMI GPUs), use the CustomRESTProvider with a lightweight REST wrapper deployed on LUMI:
Option A — Local simulation (current default)
The LUMI nodes in the registry describe the topology. Execution still runs locally using Aer. This is useful for:
- Planning shard assignments as if you had LUMI nodes
- Testing circuit decomposition before actual LUMI submission
- Emulation mode with realistic qubit counts
Option B — LUMI via REST wrapper (recommended for production)
Deploy a minimal FastAPI server on a LUMI login node that forwards circuits to GPU jobs:
# On LUMI: launch the REST bridge (once per session)
ssh lumi "cd /flash/project_465002463 && \
source venv_qcos/bin/activate && \
uvicorn qcos_lumi_bridge:app --host 0.0.0.0 --port 8888 &"
# Open SSH tunnel from your local machine
ssh -N -L 18888:localhost:8888 lumi &
Then register it via custom_rest:
registry = (
ClusterBuilder()
.add("custom_rest",
endpoints=["http://localhost:18888"] * 32, # 32 virtual workers
max_qubits=34,
api_key="lumi-bridge-secret")
.build()
)
See Custom Providers for the bridge server implementation.
Option C — Direct SLURM batch (via SynapseX pipeline)
Use the existing synapsex_server pipeline infrastructure to submit full training or simulation jobs:
# From your local machine
./scripts/pipeline/run.sh \
--stage sft \
--profile qwen2_5_14b_fast \
--tag qcos_sim_v1
The SLURM job template (sft.sbatch.j2) already handles:
module load LUMI/24.03 partition/G rocm/6.0.3- ROCm environment:
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 - Distributed launch:
srun --ntasks-per-node=1
SLURM job for QCOS quantum simulation
For running a QCOS circuit batch directly on LUMI:
#!/bin/bash
#SBATCH --job-name=qcos_sim
#SBATCH --account=project_465002463
#SBATCH --partition=standard-g
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=56
#SBATCH --time=01:00:00
#SBATCH --output=/scratch/project_465002463/logs/qcos_%j.out
module load LUMI/24.03 partition/G rocm/6.0.3
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
source /flash/project_465002463/venv_qcos/bin/activate
cd /flash/project_465002463/qcos_server
# Run your QCOS quantum simulation
PYTHONPATH=src python -c "
from network.node_providers import make_local_cluster
from network.distributed_qvm import DistributedQVM, VirtualCircuitSpec, QVMMode
from algorithms.shor import ShorAlgorithm
# On LUMI, local Aer uses ROCm automatically
registry = make_local_cluster(n_gpu_sv_nodes=8, n_gpu_mps_nodes=1)
qvm = DistributedQVM(registry, mode=QVMMode.EMULATED, shots=8192)
algo = ShorAlgorithm(qvm=qvm, shots=8192)
result = algo.run(N=32231)
print(result.to_dict())
"
Submit from local machine:
rsync -avz src/ lumi:/flash/project_465002463/qcos_server/src/
ssh lumi "sbatch /flash/project_465002463/qcos_server/jobs/qcos_sim.sbatch"
Capacity by LUMI allocation
| LUMI nodes | GCDs | SV qubits (exact) | MPS qubits (approx.) | GPU-hours/hour |
|---|---|---|---|---|
| 1 | 8 | 272 | 2,000 | 8 |
| 4 | 32 | 1,088 | 8,000 | 32 |
| 8 | 64 | 2,176 | 16,000 | 64 |
| 16 | 128 | 4,352 | 32,000 | 128 |
| 64 | 512 | 17,408 | 128,000 | 512 |
LUMI GPU-hours are billed per allocation (~€0.01–0.05/GPU-h depending on scheme), versus $0.0009–$0.00975 per shot on cloud QPUs. For 8192-shot Shor's algorithm circuits, LUMI is 100-1000× cheaper than cloud QPUs.
Troubleshooting
SSH connection fails
# Test with verbose output
ssh -v lumi
# Check key is loaded
ssh-add -l
# Add key if missing
ssh-add ~/.ssh/id_ed25519_lumi
ROCm not available
# On LUMI compute node
module load LUMI/24.03 partition/G rocm/6.0.3
rocm-smi # should show 8 MI250X GCDs
Aer doesn't use GPU
# Check Aer GPU availability
from qiskit_aer import AerSimulator
backend = AerSimulator(method="statevector", device="GPU")
print(backend.configuration().n_qubits)
# If this fails: Aer not compiled with ROCm support
# Solution: pip install qiskit-aer-gpu (ROCm wheel)
Out of memory at 34q
Reduce max_qubits_sv to 32 (16 GB) or 30 (4 GB):
.add("lumi_amd_gpu", n_nodes=2, gpus_per_node=8, max_qubits_sv=32)