hpc:slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
hpc:slurm [2022/09/29 12:35] – [GPGPU jobs] Yann Sagon | hpc:slurm [2025/07/25 10:18] (current) – [spart] Yann Sagon | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | < | + | {{METATOC 1-5}} |
+ | ====== | ||
< | < | ||
Line 67: | Line 68: | ||
* Special public partitions: | * Special public partitions: | ||
* '' | * '' | ||
- | * '' | + | * '' |
* '' | * '' | ||
* '' | * '' | ||
Line 112: | Line 113: | ||
^ Partition | ^ Partition | ||
|debug-cpu | |debug-cpu | ||
- | |debug-gpu |15 Minutes | + | |public-interactive-gpu |4 hours |
|public-interactive-cpu |8 hours |10GB | | |public-interactive-cpu |8 hours |10GB | | ||
|public-longrun-cpu | |public-longrun-cpu | ||
Line 125: | Line 126: | ||
Minimum resource is one core. | Minimum resource is one core. | ||
- | N.B. : no '' | + | N.B. : no '' |
Line 136: | Line 137: | ||
^ Partition | ^ Partition | ||
| private-< | | private-< | ||
- | |||
- | To see details about a given partition, go to the web page https:// | ||
- | If you belong in one of these groups, please contact us to request to have access to the correct partition as we have to manually add you. | ||
- | |||
Line 205: | Line 202: | ||
Example to request three titan cards: ''< | Example to request three titan cards: ''< | ||
+ | |||
You can find a detailed list of GPUs available on our clusters here : | You can find a detailed list of GPUs available on our clusters here : | ||
Line 216: | Line 214: | ||
* [[http:// | * [[http:// | ||
+ | ===== CPU ===== | ||
+ | <WRAP center round important 60%> | ||
+ | You can request all the CPUs of a compute node minus two that are reserved for the OS. See [[https:// | ||
+ | </ | ||
===== CPU types ===== | ===== CPU types ===== | ||
Line 246: | Line 247: | ||
If you want a list of those specifications, | If you want a list of those specifications, | ||
+ | |||
+ | ===== Single thread vs multi thread vs distributed jobs ===== | ||
+ | |||
+ | There are three job categories each with different needs: | ||
+ | |||
+ | ^Job type ^ Number of cpu used ^ Examples | ||
+ | | **single threaded** | **one CPU** | Python, plain R | - | | ||
+ | | **multi threaded** | ||
+ | | **distributed** | ||
+ | |||
+ | |||
+ | There are also **hybrid** jobs, where each tasks of such a job behave like a multi-threaded job. | ||
+ | This is not very common and we won't cover this case. | ||
+ | |||
+ | In slurm, you have two options for asking CPU resources: | ||
+ | |||
+ | * ''< | ||
+ | * ''< | ||
+ | |||
+ | |||
====== Submitting jobs ====== | ====== Submitting jobs ====== | ||
Line 285: | Line 306: | ||
#SBATCH --output jobname-out.o%j | #SBATCH --output jobname-out.o%j | ||
#SBATCH --ntasks 1 # number of tasks in your job. One by default | #SBATCH --ntasks 1 # number of tasks in your job. One by default | ||
- | #SBATCH --cpus-per-task 1 # number of cpus in your job. One by default | + | #SBATCH --cpus-per-task 1 # number of cpus for each task. One by default |
#SBATCH --partition debug-cpu | #SBATCH --partition debug-cpu | ||
#SBATCH --time 15:00 # maximum run time. | #SBATCH --time 15:00 # maximum run time. | ||
Line 439: | Line 460: | ||
===== GPGPU jobs ===== | ===== GPGPU jobs ===== | ||
- | When we talk about [[https:// | + | When we talk about [[https:// |
You can see on this table [[hpc: | You can see on this table [[hpc: | ||
Line 459: | Line 480: | ||
</ | </ | ||
- | It's not possible to put two types in the GRES request, but you can ask for specific compute capability, for example you don't want a compute capability smaller than 6.0: | ||
+ | Example to request two double precision GPU model: | ||
< | < | ||
- | #SBATCH --gpus=1 | + | #!/bin/sh |
- | #SBATCH --constraint=" | + | #SBATCH --partition=shared-gpu |
+ | #SBATCH --gpus=2 | ||
+ | #SBATCH --constraint=DOUBLE_PRECISION_GPU | ||
+ | |||
+ | srun nvidia-smi | ||
</ | </ | ||
+ | It's not possible to put two types in the GRES request, but you can ask for specific compute capability, for example you want to request any GPU model with compute capability bigger or equal to 7.5: | ||
+ | |||
+ | Example | ||
+ | < | ||
+ | #!/bin/sh | ||
+ | #SBATCH --partition=shared-gpu | ||
+ | #SBATCH --gpus=1 | ||
+ | #SBATCH --constraint=" | ||
+ | </ | ||
Example of script (see also https:// | Example of script (see also https:// | ||
Line 643: | Line 677: | ||
Use reservation via srun: | Use reservation via srun: | ||
- | (baobab)-[alberta@login2 ~]# srun --partition | + | (baobab)-[alberta@login2 ~]# srun --reservation |
| | ||
Use reservation via script sbatch: | Use reservation via script sbatch: | ||
Line 730: | Line 764: | ||
If you want other information please see the sacct manpage. | If you want other information please see the sacct manpage. | ||
- | + | <note tip>by default the command displays a lot of fields. | |
- | ===== Job history ===== | + | |
- | You can see your job history using '' | + | |
< | < | ||
- | [sagon@master | + | (yggdrasil)-[root@admin1 |
- | | + | |
- | ------------ ---------- ---------- ---------- ---------- ---------- -------- | + | |
- | 45517641 | + | |
- | 45517641.ba+ | + | 4 39919765.ba+ 1298188K |
- | 45517641.ex+ extern | + | |
- | 45517641.0 | + | |
- | 45518119 | + | |
- | 45518119.ba+ | + | |
- | 45518119.ex+ | + | |
</ | </ | ||
+ | </ | ||
+ | ===== Energy usage ===== | ||
+ | ==== CPUs ==== | ||
+ | You can see the energy consumption of your jobs on Yggdrasil (Baobab soon). The energy is shown in Joules using sacct. | ||
- | ===== Report and statistics with sreport ===== | + | < |
- | + | (yggdrasil)-[root@admin1 state] | |
- | To get reporting about your past jobs, you can use '' | + | |
- | + | ------------------- ---------- ------------ -------------- ----------------- | |
- | Here are some examples that can give you a starting point : | + | 2023-10-12T09:48:28 COMPLETED 28478878 |
- | + | 2023-10-12T09:48:28 COMPLETED 28478878.ex+ | |
- | To get the number of jobs you ran (you <=> '' | + | 2023-10-12T09: |
- | + | ||
- | <code console> | + | |
- | [brero@login2 ~]$ sreport job sizesbyaccount user=$USER PrintJobCount start=2018-01-01 end=2019-01-01 | + | |
- | + | ||
- | -------------------------------------------------------------------------------- | + | |
- | Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59 (31536000 secs) | + | |
- | Units are in number of jobs ran | + | |
- | -------------------------------------------------------------------------------- | + | |
- | Cluster | + | |
- | --------- --------- ------------- ------------- ------------- ------------- ------------- ------------ | + | |
- | | + | |
</ | </ | ||
+ | <note important> | ||
- | You can see how many jobs were run (grouped | + | ==== GPUs ==== |
- | < | + | If you are interested |
+ | < | ||
+ | (baobab)-[root@gpu002 ~]$ nvidia-smi dmon --select p --id 0 | ||
- | You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 : | + | # gpu pwr gtemp mtemp |
- | + | # Idx W C C | |
- | <code console> | + | |
- | [brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds | + | |
- | -------------------------------------------------------------------------------- | + | |
- | Cluster/ | + | |
- | Usage reported in CPU Seconds | + | |
- | -------------------------------------------------------------------------------- | + | |
- | | + | |
- | --------- --------------- --------- --------------- -------- -------- | + | |
- | | + | |
</ | </ | ||
- | In this example, we added the time '' | ||
- | Please note : | ||
- | * By default, the CPU time is in Minutes | ||
- | * It takes up to an hour for Slurm to upate this information in its database, so be patient | ||
- | * If you don't specify a start, nor an end date, yesterday' | ||
- | * The CPU time is the time that was allocated to you. It doesn' | ||
- | |||
- | Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow' | ||
- | |||
- | < | ||
- | sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date=" | ||
- | </ | ||
Line 803: | Line 807: | ||
==== spart ==== | ==== spart ==== | ||
+ | |||
+ | <note warning> | ||
'' | '' |
hpc/slurm.1664454941.txt.gz · Last modified: (external edit)