LUMI AMD GPU Provider

The lumi_amd_gpu provider connects QCOS to the LUMI EuroHPC supercomputer, giving access to AMD MI250X GPUs running Qiskit Aer with ROCm. LUMI is one of the largest supercomputers in the world — and its GPU partition is available to researchers and companies through EuroHPC allocations.

Hardware Specs

Component	Spec
GPU	AMD Instinct MI250X
GCDs per node	8 (each GCD = independent GPU-like unit)
HBM2e per GCD	64 GB (128 GB per physical card)
Interconnect	Slingshot-11 (200 Gb/s between nodes)
CPU per node	AMD EPYC 7A53 64-core
NVMe (flash)	3.75 TB per node
Max nodes in partition `standard-g`	2560 nodes = 20,480 GCDs

In QCOS terms:

Each GCD handles up to 34 qubits via exact ROCm statevector (64 GB)
Each node also runs 1 MPS node with up to 2000 qubits per node

Prerequisites

1. LUMI Access

You need a LUMI allocation. Options:

EuroHPC Regular Access — competitive calls, free for EU researchers
EuroHPC Extreme Access — for very large allocations
LUMI-G pilot — for companies via CSC (Finland)

Contact: lumi-supercomputer.eu

2. SSH Configuration

Add to your ~/.ssh/config:

Host lumi
    HostName lumi.csc.fi
    User YOUR_LUMI_USERNAME
    IdentityFile ~/.ssh/id_ed25519_lumi
    ServerAliveInterval 60

Test connection:

ssh lumi hostname
# Expected: lumi-login01.lumi.csc.fi (or similar)

3. Python environment on LUMI

The SynapseX project (project_465002463) already has a venv at /flash/project_465002463/venv_synapsex/. For your own project:

# On LUMI login node
ssh lumi

# Load Python module
module load cray-python/3.11.7

# Create venv (on flash for speed)
python3 -m venv /flash/$LUMI_PROJECT/venv_qcos
source /flash/$LUMI_PROJECT/venv_qcos/bin/activate

# Install Qiskit Aer with ROCm support
pip install qiskit qiskit-aer-gpu

Using the LUMI Provider

Basic registration

from network.node_providers import ClusterBuilder, make_lumi_cluster
from network.distributed_qvm import DistributedQVM, QVMMode

# Simple: 1 LUMI node × 8 GCDs = 8 statevector nodes + 1 MPS node
registry = make_lumi_cluster(
    n_nodes=1,
    gpus_per_node=8,
    lumi_host="lumi",
    lumi_project="project_465002463",
    include_local_fallback=True,  # adds local CPU for small circuits
)

qvm = DistributedQVM(registry, mode=QVMMode.EMULATED, shots=4096)

Multi-node LUMI cluster

# 4 LUMI nodes × 8 GCDs = 32 SV nodes (34q each) + 4 MPS nodes (2000q each)
registry = (
    ClusterBuilder()
    .add("lumi_amd_gpu",
         n_nodes=4,
         gpus_per_node=8,
         max_qubits_sv=34,      # statevector per GCD (RAM limited)
         max_qubits_mps=2000,   # MPS per node
         lumi_host="lumi",
         lumi_project="project_465002463")
    .add("local_cpu", n_nodes=4)   # local fallback for tiny circuits
    .build()
)

report = registry.status_report()
print(f"Nodes:         {report['total_nodes']}")
print(f"Online:        {report['online_nodes']}")
print(f"Total qubits:  {report['total_physical_qubits']}")
# Nodes:         37   (32 SV + 4 MPS + 1 local CPU)
# Online:        37
# Total qubits:  9188

Inspecting LUMI node details

for node in registry.all_nodes():
    if "lumi" in node.tags:
        print(f"  {node.node_id:30s}  {node.node_type.value:15s}  {node.max_qubits:5d}q")

  lumi-n00-gcd0                  gpu_aer            34q
  lumi-n00-gcd1                  gpu_aer            34q
  ...
  lumi-n00-gcd7                  gpu_aer            34q
  lumi-n00-mps                   gpu_tensor       2000q
  lumi-n01-gcd0                  gpu_aer            34q
  ...

Running circuits on LUMI

Currently, LUMI nodes are registered as local Aer nodes with ROCm credentials stored in QPUNodeSpec.credentials. The NodeExecutor routes to Aer automatically.

For direct SLURM batch submission (running the actual Aer simulation on LUMI GPUs), use the CustomRESTProvider with a lightweight REST wrapper deployed on LUMI:

Option A — Local simulation (current default)

The LUMI nodes in the registry describe the topology. Execution still runs locally using Aer. This is useful for:

Planning shard assignments as if you had LUMI nodes
Testing circuit decomposition before actual LUMI submission
Emulation mode with realistic qubit counts

Option B — LUMI via REST wrapper (recommended for production)

Deploy a minimal FastAPI server on a LUMI login node that forwards circuits to GPU jobs:

# On LUMI: launch the REST bridge (once per session)
ssh lumi "cd /flash/project_465002463 && \
  source venv_qcos/bin/activate && \
  uvicorn qcos_lumi_bridge:app --host 0.0.0.0 --port 8888 &"

# Open SSH tunnel from your local machine
ssh -N -L 18888:localhost:8888 lumi &

Then register it via custom_rest:

registry = (
    ClusterBuilder()
    .add("custom_rest",
         endpoints=["http://localhost:18888"] * 32,  # 32 virtual workers
         max_qubits=34,
         api_key="lumi-bridge-secret")
    .build()
)

See Custom Providers for the bridge server implementation.

Option C — Direct SLURM batch (via SynapseX pipeline)

Use the existing synapsex_server pipeline infrastructure to submit full training or simulation jobs:

# From your local machine
./scripts/pipeline/run.sh \
    --stage sft \
    --profile qwen2_5_14b_fast \
    --tag qcos_sim_v1

The SLURM job template (sft.sbatch.j2) already handles:

module load LUMI/24.03 partition/G rocm/6.0.3
ROCm environment: HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Distributed launch: srun --ntasks-per-node=1

SLURM job for QCOS quantum simulation

For running a QCOS circuit batch directly on LUMI:

#!/bin/bash
#SBATCH --job-name=qcos_sim
#SBATCH --account=project_465002463
#SBATCH --partition=standard-g
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=56
#SBATCH --time=01:00:00
#SBATCH --output=/scratch/project_465002463/logs/qcos_%j.out

module load LUMI/24.03 partition/G rocm/6.0.3

export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True

source /flash/project_465002463/venv_qcos/bin/activate

cd /flash/project_465002463/qcos_server

# Run your QCOS quantum simulation
PYTHONPATH=src python -c "
from network.node_providers import make_local_cluster
from network.distributed_qvm import DistributedQVM, VirtualCircuitSpec, QVMMode
from algorithms.shor import ShorAlgorithm

# On LUMI, local Aer uses ROCm automatically
registry = make_local_cluster(n_gpu_sv_nodes=8, n_gpu_mps_nodes=1)
qvm = DistributedQVM(registry, mode=QVMMode.EMULATED, shots=8192)
algo = ShorAlgorithm(qvm=qvm, shots=8192)
result = algo.run(N=32231)
print(result.to_dict())
"

Submit from local machine:

rsync -avz src/ lumi:/flash/project_465002463/qcos_server/src/
ssh lumi "sbatch /flash/project_465002463/qcos_server/jobs/qcos_sim.sbatch"

Capacity by LUMI allocation

LUMI nodes	GCDs	SV qubits (exact)	MPS qubits (approx.)	GPU-hours/hour
1	8	272	2,000	8
4	32	1,088	8,000	32
8	64	2,176	16,000	64
16	128	4,352	32,000	128
64	512	17,408	128,000	512

Cost comparison

LUMI GPU-hours are billed per allocation (~€0.01–0.05/GPU-h depending on scheme), versus $0.0009–$0.00975 per shot on cloud QPUs. For 8192-shot Shor's algorithm circuits, LUMI is 100-1000× cheaper than cloud QPUs.

Troubleshooting

SSH connection fails

# Test with verbose output
ssh -v lumi

# Check key is loaded
ssh-add -l

# Add key if missing
ssh-add ~/.ssh/id_ed25519_lumi

ROCm not available

# On LUMI compute node
module load LUMI/24.03 partition/G rocm/6.0.3
rocm-smi  # should show 8 MI250X GCDs

Aer doesn't use GPU

# Check Aer GPU availability
from qiskit_aer import AerSimulator
backend = AerSimulator(method="statevector", device="GPU")
print(backend.configuration().n_qubits)
# If this fails: Aer not compiled with ROCm support
# Solution: pip install qiskit-aer-gpu (ROCm wheel)

Out of memory at 34q

Reduce max_qubits_sv to 32 (16 GB) or 30 (4 GB):

.add("lumi_amd_gpu", n_nodes=2, gpus_per_node=8, max_qubits_sv=32)

Hardware Specs​

Prerequisites​

1. LUMI Access​

2. SSH Configuration​

3. Python environment on LUMI​

Using the LUMI Provider​

Basic registration​

Multi-node LUMI cluster​

Inspecting LUMI node details​

Running circuits on LUMI​

Option A — Local simulation (current default)​

Option B — LUMI via REST wrapper (recommended for production)​

Option C — Direct SLURM batch (via SynapseX pipeline)​

SLURM job for QCOS quantum simulation​

Capacity by LUMI allocation​

Troubleshooting​

SSH connection fails​

ROCm not available​

Aer doesn't use GPU​

Out of memory at 34q​