Using CodeCarbon on SLURM (Adastra/ROCm Example)¶

This guide walks through using CodeCarbon on SLURM-based HPC clusters. The examples are specific to the Adastra supercomputer with AMD ROCm GPUs, but the general approach applies to any SLURM cluster with internet-connected login nodes.

For a general approach to running CodeCarbon on any Linux server without HPC complexity, see the Linux Service guide.

Overview¶

This guide shows how to run CodeCarbon on SLURM-based HPC clusters like Adastra (powered by GENCI/CINES). The examples use AMD ROCm GPUs, but the approach applies to any SLURM cluster with internet-connected login nodes.

Prerequisites¶

Access to a SLURM-based HPC cluster
Login node with internet access
Python 3.10+ on the cluster
Compute nodes (may be offline from internet)

Architecture Overview¶

Adastra uses a standard HPC security model:

Login nodes have internet access and are accessible from outside
Compute nodes run your GPU workloads without direct internet access
Python environments are set up on login nodes and shared via network storage
Jobs are submitted from the login node using sbatch

For sites requiring jump hosts (bastion servers), SSH jump (-J) can route through an intermediate server.

The Python environment is set up on the login node and shared with compute nodes via network storage. Jobs are submitted from the login node using sbatch, and the SLURM script loads the environment and runs code on compute nodes.

Debug Partition

If the --time option is less than 30 minutes, the job is placed in the debug partition, which has faster scheduling but shorter maximum runtime.

Setup Steps¶

Step 1: Configure Your Environment Variables¶

Set up environment variables for your HPC configuration. Add these to your .bashrc or .zshrc:

Adapt the following environment variables with your own configuration. You can add them to your .bashrc or .zshrc for convenience.

export BASTION_IP="xx.xx.xx.xx"
export BASTION_USER="username"
export HPC_HOST="xx.xx.fr"
export HPC_PASS="xxxxx"
export PROJECT_ID="xxx"
export USER_NAME="username_hpc"
export HPC_PROJECT_FOLDER="/lus/home/xxx"

Step 2: Connect to the HPC Cluster¶

Connect to your HPC login node:

Using sshpass (automated):

sshpass -p "$HPC_PASS" ssh -J $BASTION_USER@$BASTION_IP $USER_NAME@$HPC_HOST

For first-time connection (debug SSH issues):

ssh -o ServerAliveInterval=60 $BASTION_USER@$BASTION_IP
ssh -o ServerAliveInterval=60 $USER_NAME@$HPC_HOST

Step 3: Copy Your Code to the HPC Cluster¶

sshpass -p "$HPC_PASS" scp -r -J $BASTION_USER@$BASTION_IP /you/folder/* $USER_NAME@$HPC_HOST:$HPC_PROJECT_FOLDER

Step 4: Install CodeCarbon and Dependencies¶

ROCM Compatibility

Install the correct version of amdsmi that matches your ROCM version. For Adastra, use amdsmi==7.0.1 for compatibility with ROCM 6.4.3.

Option A: Simple Installation (Recommended)¶

module load python/3.12
module load rocm/7.0.1

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
# Important: Adastra's MI250 runs ROCm 6.4.3 natively.
# With export ROCM_PATH=/opt/rocm-6.4.3 in our SLURM script, this python wheel perfectly matches the C library without symlink issues!
pip install amdsmi==7.0.1
pip install codecarbon

Option B: Development Installation with PyTorch¶

module load python/3.12
module load rocm/7.0.1
git clone https://github.com/mlco2/codecarbon.git
# If you want a specific version, use git checkout <tag> to switch to the desired version.
git checkout -b feat/rocm
cd codecarbon
python -m venv .venv
source .venv/bin/activate
python -V
# Must be 3.12.x
pip install --upgrade pip
# Important: Adastra's MI250 runs ROCm 6.4.3 natively.
# With export ROCM_PATH=/opt/rocm-6.4.3 in our SLURM script, this python wheel perfectly matches the C library without symlink issues!
pip install amdsmi==7.0.1
# Look at https://download.pytorch.org/whl/torch/ for the correct version matching your Python (cp312) and ROCM version.
# torch-2.10.0+rocm7.0-cp312-cp312-manylinux_2_28_x86_64.whl
pip3 install torch==2.10.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.0
pip install numpy

# Install CodeCarbon in editable mode to allow for live code changes without reinstallation
pip install -e .

Step 5: Development Workflow¶

You can code on the login Node, but we suggest to do the development on your local machine and then push the code to a repository (e.g., GitHub) and pull it from the login node. This way you avoid losing code and keep tracks of the changes.

Code locally on your machine and push to a repository (GitHub, etc.)
Pull on the login node to avoid losing work
Activate the environment after each login:

cd codecarbon
git pull
source .venv/bin/activate

Step 6: Submit a Job¶

Submit your CodeCarbon job to the SLURM scheduler:

Use sbatch to submit your job script:

sbatch examples/slurm_rocm/run_codecarbon_pytorch.slurm

Step 7: Monitor Job Status¶

Monitor your job execution:

# View all running jobs
squeue -u $USER

# View specific job output
tail -f logs/<job_id>.out

# View job details
sinfo

Troubleshooting¶

Error: AMD GPU detected but amdsmi is not properly configured¶

[codecarbon WARNING @ 10:28:46] AMD GPU detected but amdsmi is not properly configured.
Error: /opt/rocm/lib/libamd_smi.so: undefined symbol: amdsmi_get_cpu_affinity_with_scope

Solution: You have a version mismatch between amdsmi Python package and ROCM. Install the correct version:

# For ROCM 7.0.1:
pip install amdsmi==7.0.1

# Ensure Python version matches your requirements (3.12 for Adastra)
python -V

Error: KeyError 'ROCM_PATH'¶

This means the ROCm module is not loaded. Load it before running your job:

module load rocm/7.0.1

Next Steps¶

View your emissions results on the CodeCarbon dashboard
Configure CodeCarbon for different measurement intervals
Explore other deployment options for non-HPC systems

Limitations and Future Work¶

The AMD Instinct MI250 accelerator card contains two Graphics Compute Dies (GCDs) per physical card. However, when monitoring energy consumption (e.g., via rocm-smi or tools like CodeCarbon), only one GCD reports power usage, while the other shows zero values. This is problematic for accurate energy accounting, especially in HPC/SLURM environments where jobs may be allocated a single GCD.

So in that case we display a warning.

In a future work we will use average_gfx_activity to estimate the corresponding power of both GCDs, and provide an estimation instead of 0.

Documentation¶

Annex: Example of Job Details with scontrol¶

This trace was obtained to adapt codecarbon/core/util.py to properly parse the SLURM job details and extract the relevant information about GPU and CPU allocation.

[$PROJECT_ID] $USER_NAME@login5:~/codecarbon$ scontrol show job  4687018
JobId=4687018 JobName=codecarbon-test
   UserId=$USER_NAME(xxx) GroupId=grp_$USER_NAME(xxx) MCS_label=N/A
   Priority=900000 Nice=0 Account=xxxxxx QOS=debug
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:24 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2026-03-02T17:12:49 EligibleTime=2026-03-02T17:12:49
   AccrueTime=2026-03-02T17:12:49
   StartTime=2026-03-02T17:12:49 EndTime=2026-03-02T17:13:13 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-03-02T17:12:49 Scheduler=Main
   Partition=mi250-shared AllocNode:Sid=login5:2553535
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=g1341
   BatchHost=g1341
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:1
   ReqTRES=cpu=8,mem=29000M,node=1,billing=8,gres/gpu=1
   AllocTRES=cpu=16,mem=29000M,energy=10211,node=1,billing=16,gres/gpu=1,gres/gpu:mi250x=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=8 MinMemoryNode=29000M MinTmpDiskNode=0
   Features=MI250&DEBUG DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon/run_codecarbon.sh
   WorkDir=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon
   AdminComment=Accounting=1 
   StdErr=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon/logs/4687018.err
   StdIn=/dev/null
   StdOut=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon/logs/4687018.out
   TresPerNode=gres/gpu:1
   TresPerTask=cpu=8