hpc:slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
hpc:slurm [2022/08/08 08:06] – Yann Sagon | hpc:slurm [2025/04/08 17:05] (current) – [Clusters partitions] Adrien Albert | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | < | + | {{METATOC 1-5}} |
+ | ====== | ||
< | < | ||
Line 67: | Line 68: | ||
* Special public partitions: | * Special public partitions: | ||
* '' | * '' | ||
- | * '' | + | * '' |
* '' | * '' | ||
* '' | * '' | ||
Line 112: | Line 113: | ||
^ Partition | ^ Partition | ||
|debug-cpu | |debug-cpu | ||
- | |debug-gpu |15 Minutes | + | |public-interactive-gpu |4 hours |
|public-interactive-cpu |8 hours |10GB | | |public-interactive-cpu |8 hours |10GB | | ||
|public-longrun-cpu | |public-longrun-cpu | ||
Line 125: | Line 126: | ||
Minimum resource is one core. | Minimum resource is one core. | ||
- | N.B. : no '' | + | N.B. : no '' |
Line 205: | Line 206: | ||
Example to request three titan cards: ''< | Example to request three titan cards: ''< | ||
+ | |||
You can find a detailed list of GPUs available on our clusters here : | You can find a detailed list of GPUs available on our clusters here : | ||
Line 216: | Line 218: | ||
* [[http:// | * [[http:// | ||
+ | ===== CPU ===== | ||
+ | <WRAP center round important 60%> | ||
+ | You can request all the CPUs of a compute node minus two that are reserved for the OS. See [[https:// | ||
+ | </ | ||
===== CPU types ===== | ===== CPU types ===== | ||
Line 246: | Line 251: | ||
If you want a list of those specifications, | If you want a list of those specifications, | ||
+ | |||
+ | ===== Single thread vs multi thread vs distributed jobs ===== | ||
+ | |||
+ | There are three job categories each with different needs: | ||
+ | |||
+ | ^Job type ^ Number of cpu used ^ Examples | ||
+ | | **single threaded** | **one CPU** | Python, plain R | - | | ||
+ | | **multi threaded** | ||
+ | | **distributed** | ||
+ | |||
+ | |||
+ | There are also **hybrid** jobs, where each tasks of such a job behave like a multi-threaded job. | ||
+ | This is not very common and we won't cover this case. | ||
+ | |||
+ | In slurm, you have two options for asking CPU resources: | ||
+ | |||
+ | * ''< | ||
+ | * ''< | ||
+ | |||
+ | |||
====== Submitting jobs ====== | ====== Submitting jobs ====== | ||
Line 285: | Line 310: | ||
#SBATCH --output jobname-out.o%j | #SBATCH --output jobname-out.o%j | ||
#SBATCH --ntasks 1 # number of tasks in your job. One by default | #SBATCH --ntasks 1 # number of tasks in your job. One by default | ||
- | #SBATCH --cpus-per-task 1 # number of cpus in your job. One by default | + | #SBATCH --cpus-per-task 1 # number of cpus for each task. One by default |
#SBATCH --partition debug-cpu | #SBATCH --partition debug-cpu | ||
#SBATCH --time 15:00 # maximum run time. | #SBATCH --time 15:00 # maximum run time. | ||
Line 439: | Line 464: | ||
===== GPGPU jobs ===== | ===== GPGPU jobs ===== | ||
- | When we talk about [[https:// | + | When we talk about [[https:// |
+ | |||
+ | You can see on this table [[hpc: | ||
+ | |||
+ | * on board memory in GB | ||
+ | * simple precision vs double precision for float calculation | ||
+ | * compute capability | ||
+ | |||
+ | Specify the memory needed. For example, request one GPU that has 10G at least. | ||
+ | |||
+ | < | ||
+ | srun --gres=gpu: | ||
+ | </ | ||
+ | |||
+ | If you just need a GPU and you don't care of the type, don't specify it. You'll get the lower model available. | ||
+ | |||
+ | < | ||
+ | #SBATCH --gpus=1 | ||
+ | </ | ||
+ | |||
+ | |||
+ | Example to request two double precision GPU model: | ||
+ | < | ||
+ | #!/bin/sh | ||
+ | #SBATCH --partition=shared-gpu | ||
+ | #SBATCH --gpus=2 | ||
+ | #SBATCH --constraint=DOUBLE_PRECISION_GPU | ||
+ | |||
+ | srun nvidia-smi | ||
+ | </ | ||
+ | |||
+ | It's not possible to put two types in the GRES request, but you can ask for specific compute capability, for example you want to request any GPU model with compute capability bigger or equal to 7.5: | ||
+ | |||
+ | Example | ||
+ | < | ||
+ | #!/bin/sh | ||
+ | #SBATCH --partition=shared-gpu | ||
+ | #SBATCH --gpus=1 | ||
+ | #SBATCH --constraint=" | ||
+ | </ | ||
Example of script (see also https:// | Example of script (see also https:// | ||
Line 473: | Line 537: | ||
In this case, this means that node gpu002 has three Titan cards, and all of them are allocated. | In this case, this means that node gpu002 has three Titan cards, and all of them are allocated. | ||
- | If you just need a gpu and you don't care of the type, don't specify it: | ||
- | < | ||
- | #SBATCH --gpus=1 | ||
- | </ | ||
- | |||
- | It's not possible to put two types in the GRES request, but you can ask for specific compute capability, for example you don't want a compute capability smaller than 6.0: | ||
- | |||
- | < | ||
- | #SBATCH --gpus=1 | ||
- | #SBATCH --constraint=" | ||
- | </ | ||
===== Interactive jobs ===== | ===== Interactive jobs ===== | ||
Line 618: | Line 671: | ||
* [[https:// | * [[https:// | ||
+ | |||
+ | ====== Reservation ====== | ||
+ | to request for reservation; | ||
+ | |||
+ | list reservation: | ||
+ | |||
+ | (baobab)-[alberta@login2 ~]#scontrol show res | ||
+ | |||
+ | Use reservation via srun: | ||
+ | |||
+ | (baobab)-[alberta@login2 ~]# srun --reservation < | ||
+ | | ||
+ | Use reservation via script sbatch: | ||
+ | #SBATCH --reservation < | ||
+ | |||
+ | #!/bin/bash | ||
+ | | ||
+ | #SBATCH --job-name=test_unitaire | ||
+ | #SBATCH --reservation test | ||
+ | srun hostname | ||
+ | | ||
====== Job monitoring ====== | ====== Job monitoring ====== | ||
Line 694: | Line 768: | ||
If you want other information please see the sacct manpage. | If you want other information please see the sacct manpage. | ||
- | + | <note tip>by default the command displays a lot of fields. | |
- | ===== Job history ===== | + | |
- | You can see your job history using '' | + | |
< | < | ||
- | [sagon@master | + | (yggdrasil)-[root@admin1 |
- | | + | |
- | ------------ ---------- ---------- ---------- ---------- ---------- -------- | + | |
- | 45517641 | + | |
- | 45517641.ba+ | + | 4 39919765.ba+ 1298188K |
- | 45517641.ex+ extern | + | |
- | 45517641.0 | + | |
- | 45518119 | + | |
- | 45518119.ba+ | + | |
- | 45518119.ex+ | + | |
</ | </ | ||
+ | </ | ||
+ | ===== Energy usage ===== | ||
+ | ==== CPUs ==== | ||
+ | You can see the energy consumption of your jobs on Yggdrasil (Baobab soon). The energy is shown in Joules using sacct. | ||
- | ===== Report and statistics with sreport ===== | + | < |
- | + | (yggdrasil)-[root@admin1 state] | |
- | To get reporting about your past jobs, you can use '' | + | |
- | + | ------------------- ---------- ------------ -------------- ----------------- | |
- | Here are some examples that can give you a starting point : | + | 2023-10-12T09:48:28 COMPLETED 28478878 |
- | + | 2023-10-12T09:48:28 COMPLETED 28478878.ex+ | |
- | To get the number of jobs you ran (you <=> '' | + | 2023-10-12T09: |
- | + | ||
- | <code console> | + | |
- | [brero@login2 ~]$ sreport job sizesbyaccount user=$USER PrintJobCount start=2018-01-01 end=2019-01-01 | + | |
- | + | ||
- | -------------------------------------------------------------------------------- | + | |
- | Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59 (31536000 secs) | + | |
- | Units are in number of jobs ran | + | |
- | -------------------------------------------------------------------------------- | + | |
- | Cluster | + | |
- | --------- --------- ------------- ------------- ------------- ------------- ------------- ------------ | + | |
- | | + | |
</ | </ | ||
+ | <note important> | ||
- | You can see how many jobs were run (grouped | + | ==== GPUs ==== |
- | < | + | If you are interested |
+ | < | ||
+ | (baobab)-[root@gpu002 ~]$ nvidia-smi dmon --select p --id 0 | ||
- | You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 : | + | # gpu pwr gtemp mtemp |
- | + | # Idx W C C | |
- | <code console> | + | |
- | [brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds | + | |
- | -------------------------------------------------------------------------------- | + | |
- | Cluster/ | + | |
- | Usage reported in CPU Seconds | + | |
- | -------------------------------------------------------------------------------- | + | |
- | | + | |
- | --------- --------------- --------- --------------- -------- -------- | + | |
- | | + | |
</ | </ | ||
- | In this example, we added the time '' | ||
- | Please note : | ||
- | * By default, the CPU time is in Minutes | ||
- | * It takes up to an hour for Slurm to upate this information in its database, so be patient | ||
- | * If you don't specify a start, nor an end date, yesterday' | ||
- | * The CPU time is the time that was allocated to you. It doesn' | ||
- | |||
- | Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow' | ||
- | |||
- | < | ||
- | sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date=" | ||
- | </ | ||
hpc/slurm.1659945983.txt.gz · Last modified: 2022/08/08 08:06 by Yann Sagon