Skip to main content

LUMI AMD GPU Provider

The lumi_amd_gpu provider connects QCOS to the LUMI EuroHPC supercomputer, giving access to AMD MI250X GPUs running Qiskit Aer with ROCm. LUMI is one of the largest supercomputers in the world — and its GPU partition is available to researchers and companies through EuroHPC allocations.


Hardware Specs

ComponentSpec
GPUAMD Instinct MI250X
GCDs per node8 (each GCD = independent GPU-like unit)
HBM2e per GCD64 GB (128 GB per physical card)
InterconnectSlingshot-11 (200 Gb/s between nodes)
CPU per nodeAMD EPYC 7A53 64-core
NVMe (flash)3.75 TB per node
Max nodes in partition standard-g2560 nodes = 20,480 GCDs

In QCOS terms:

  • Each GCD handles up to 34 qubits via exact ROCm statevector (64 GB)
  • Each node also runs 1 MPS node with up to 2000 qubits per node

Prerequisites

1. LUMI Access

You need a LUMI allocation. Options:

  • EuroHPC Regular Access — competitive calls, free for EU researchers
  • EuroHPC Extreme Access — for very large allocations
  • LUMI-G pilot — for companies via CSC (Finland)

Contact: lumi-supercomputer.eu

2. SSH Configuration

Add to your ~/.ssh/config:

Host lumi
HostName lumi.csc.fi
User YOUR_LUMI_USERNAME
IdentityFile ~/.ssh/id_ed25519_lumi
ServerAliveInterval 60

Test connection:

ssh lumi hostname
# Expected: lumi-login01.lumi.csc.fi (or similar)

3. Python environment on LUMI

The SynapseX project (project_465002463) already has a venv at /flash/project_465002463/venv_synapsex/. For your own project:

# On LUMI login node
ssh lumi

# Load Python module
module load cray-python/3.11.7

# Create venv (on flash for speed)
python3 -m venv /flash/$LUMI_PROJECT/venv_qcos
source /flash/$LUMI_PROJECT/venv_qcos/bin/activate

# Install Qiskit Aer with ROCm support
pip install qiskit qiskit-aer-gpu

Using the LUMI Provider

Basic registration

from network.node_providers import ClusterBuilder, make_lumi_cluster
from network.distributed_qvm import DistributedQVM, QVMMode

# Simple: 1 LUMI node × 8 GCDs = 8 statevector nodes + 1 MPS node
registry = make_lumi_cluster(
n_nodes=1,
gpus_per_node=8,
lumi_host="lumi",
lumi_project="project_465002463",
include_local_fallback=True, # adds local CPU for small circuits
)

qvm = DistributedQVM(registry, mode=QVMMode.EMULATED, shots=4096)

Multi-node LUMI cluster

# 4 LUMI nodes × 8 GCDs = 32 SV nodes (34q each) + 4 MPS nodes (2000q each)
registry = (
ClusterBuilder()
.add("lumi_amd_gpu",
n_nodes=4,
gpus_per_node=8,
max_qubits_sv=34, # statevector per GCD (RAM limited)
max_qubits_mps=2000, # MPS per node
lumi_host="lumi",
lumi_project="project_465002463")
.add("local_cpu", n_nodes=4) # local fallback for tiny circuits
.build()
)

report = registry.status_report()
print(f"Nodes: {report['total_nodes']}")
print(f"Online: {report['online_nodes']}")
print(f"Total qubits: {report['total_physical_qubits']}")
# Nodes: 37 (32 SV + 4 MPS + 1 local CPU)
# Online: 37
# Total qubits: 9188

Inspecting LUMI node details

for node in registry.all_nodes():
if "lumi" in node.tags:
print(f" {node.node_id:30s} {node.node_type.value:15s} {node.max_qubits:5d}q")
  lumi-n00-gcd0                  gpu_aer            34q
lumi-n00-gcd1 gpu_aer 34q
...
lumi-n00-gcd7 gpu_aer 34q
lumi-n00-mps gpu_tensor 2000q
lumi-n01-gcd0 gpu_aer 34q
...

Running circuits on LUMI

Currently, LUMI nodes are registered as local Aer nodes with ROCm credentials stored in QPUNodeSpec.credentials. The NodeExecutor routes to Aer automatically.

For direct SLURM batch submission (running the actual Aer simulation on LUMI GPUs), use the CustomRESTProvider with a lightweight REST wrapper deployed on LUMI:

Option A — Local simulation (current default)

The LUMI nodes in the registry describe the topology. Execution still runs locally using Aer. This is useful for:

  • Planning shard assignments as if you had LUMI nodes
  • Testing circuit decomposition before actual LUMI submission
  • Emulation mode with realistic qubit counts

Deploy a minimal FastAPI server on a LUMI login node that forwards circuits to GPU jobs:

# On LUMI: launch the REST bridge (once per session)
ssh lumi "cd /flash/project_465002463 && \
source venv_qcos/bin/activate && \
uvicorn qcos_lumi_bridge:app --host 0.0.0.0 --port 8888 &"

# Open SSH tunnel from your local machine
ssh -N -L 18888:localhost:8888 lumi &

Then register it via custom_rest:

registry = (
ClusterBuilder()
.add("custom_rest",
endpoints=["http://localhost:18888"] * 32, # 32 virtual workers
max_qubits=34,
api_key="lumi-bridge-secret")
.build()
)

See Custom Providers for the bridge server implementation.

Option C — Direct SLURM batch (via SynapseX pipeline)

Use the existing synapsex_server pipeline infrastructure to submit full training or simulation jobs:

# From your local machine
./scripts/pipeline/run.sh \
--stage sft \
--profile qwen2_5_14b_fast \
--tag qcos_sim_v1

The SLURM job template (sft.sbatch.j2) already handles:

  • module load LUMI/24.03 partition/G rocm/6.0.3
  • ROCm environment: HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  • Distributed launch: srun --ntasks-per-node=1

SLURM job for QCOS quantum simulation

For running a QCOS circuit batch directly on LUMI:

#!/bin/bash
#SBATCH --job-name=qcos_sim
#SBATCH --account=project_465002463
#SBATCH --partition=standard-g
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=56
#SBATCH --time=01:00:00
#SBATCH --output=/scratch/project_465002463/logs/qcos_%j.out

module load LUMI/24.03 partition/G rocm/6.0.3

export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True

source /flash/project_465002463/venv_qcos/bin/activate

cd /flash/project_465002463/qcos_server

# Run your QCOS quantum simulation
PYTHONPATH=src python -c "
from network.node_providers import make_local_cluster
from network.distributed_qvm import DistributedQVM, VirtualCircuitSpec, QVMMode
from algorithms.shor import ShorAlgorithm

# On LUMI, local Aer uses ROCm automatically
registry = make_local_cluster(n_gpu_sv_nodes=8, n_gpu_mps_nodes=1)
qvm = DistributedQVM(registry, mode=QVMMode.EMULATED, shots=8192)
algo = ShorAlgorithm(qvm=qvm, shots=8192)
result = algo.run(N=32231)
print(result.to_dict())
"

Submit from local machine:

rsync -avz src/ lumi:/flash/project_465002463/qcos_server/src/
ssh lumi "sbatch /flash/project_465002463/qcos_server/jobs/qcos_sim.sbatch"

Capacity by LUMI allocation

LUMI nodesGCDsSV qubits (exact)MPS qubits (approx.)GPU-hours/hour
182722,0008
4321,0888,00032
8642,17616,00064
161284,35232,000128
6451217,408128,000512
Cost comparison

LUMI GPU-hours are billed per allocation (~€0.01–0.05/GPU-h depending on scheme), versus $0.0009–$0.00975 per shot on cloud QPUs. For 8192-shot Shor's algorithm circuits, LUMI is 100-1000× cheaper than cloud QPUs.


Troubleshooting

SSH connection fails

# Test with verbose output
ssh -v lumi

# Check key is loaded
ssh-add -l

# Add key if missing
ssh-add ~/.ssh/id_ed25519_lumi

ROCm not available

# On LUMI compute node
module load LUMI/24.03 partition/G rocm/6.0.3
rocm-smi # should show 8 MI250X GCDs

Aer doesn't use GPU

# Check Aer GPU availability
from qiskit_aer import AerSimulator
backend = AerSimulator(method="statevector", device="GPU")
print(backend.configuration().n_qubits)
# If this fails: Aer not compiled with ROCm support
# Solution: pip install qiskit-aer-gpu (ROCm wheel)

Out of memory at 34q

Reduce max_qubits_sv to 32 (16 GB) or 30 (4 GB):

.add("lumi_amd_gpu", n_nodes=2, gpus_per_node=8, max_qubits_sv=32)