hpc:slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
hpc:slurm [2022/10/13 11:13] – Yann Sagon | hpc:slurm [2025/04/08 17:05] (current) – [Clusters partitions] Adrien Albert | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | < | + | {{METATOC 1-5}} |
+ | ====== | ||
< | < | ||
Line 67: | Line 68: | ||
* Special public partitions: | * Special public partitions: | ||
* '' | * '' | ||
- | * '' | + | * '' |
* '' | * '' | ||
* '' | * '' | ||
Line 112: | Line 113: | ||
^ Partition | ^ Partition | ||
|debug-cpu | |debug-cpu | ||
- | |debug-gpu |15 Minutes | + | |public-interactive-gpu |4 hours |
|public-interactive-cpu |8 hours |10GB | | |public-interactive-cpu |8 hours |10GB | | ||
|public-longrun-cpu | |public-longrun-cpu | ||
Line 125: | Line 126: | ||
Minimum resource is one core. | Minimum resource is one core. | ||
- | N.B. : no '' | + | N.B. : no '' |
Line 205: | Line 206: | ||
Example to request three titan cards: ''< | Example to request three titan cards: ''< | ||
+ | |||
You can find a detailed list of GPUs available on our clusters here : | You can find a detailed list of GPUs available on our clusters here : | ||
Line 216: | Line 218: | ||
* [[http:// | * [[http:// | ||
+ | ===== CPU ===== | ||
+ | <WRAP center round important 60%> | ||
+ | You can request all the CPUs of a compute node minus two that are reserved for the OS. See [[https:// | ||
+ | </ | ||
===== CPU types ===== | ===== CPU types ===== | ||
Line 246: | Line 251: | ||
If you want a list of those specifications, | If you want a list of those specifications, | ||
+ | |||
+ | ===== Single thread vs multi thread vs distributed jobs ===== | ||
+ | |||
+ | There are three job categories each with different needs: | ||
+ | |||
+ | ^Job type ^ Number of cpu used ^ Examples | ||
+ | | **single threaded** | **one CPU** | Python, plain R | - | | ||
+ | | **multi threaded** | ||
+ | | **distributed** | ||
+ | |||
+ | |||
+ | There are also **hybrid** jobs, where each tasks of such a job behave like a multi-threaded job. | ||
+ | This is not very common and we won't cover this case. | ||
+ | |||
+ | In slurm, you have two options for asking CPU resources: | ||
+ | |||
+ | * ''< | ||
+ | * ''< | ||
+ | |||
+ | |||
====== Submitting jobs ====== | ====== Submitting jobs ====== | ||
Line 439: | Line 464: | ||
===== GPGPU jobs ===== | ===== GPGPU jobs ===== | ||
- | When we talk about [[https:// | + | When we talk about [[https:// |
You can see on this table [[hpc: | You can see on this table [[hpc: | ||
Line 477: | Line 502: | ||
#SBATCH --partition=shared-gpu | #SBATCH --partition=shared-gpu | ||
#SBATCH --gpus=1 | #SBATCH --gpus=1 | ||
- | #SBATCH --constraint=" | + | #SBATCH --constraint=" |
</ | </ | ||
Line 656: | Line 681: | ||
Use reservation via srun: | Use reservation via srun: | ||
- | (baobab)-[alberta@login2 ~]# srun --partition | + | (baobab)-[alberta@login2 ~]# srun --reservation |
| | ||
Use reservation via script sbatch: | Use reservation via script sbatch: | ||
Line 743: | Line 768: | ||
If you want other information please see the sacct manpage. | If you want other information please see the sacct manpage. | ||
- | + | <note tip>by default the command displays a lot of fields. | |
- | ===== Job history ===== | + | |
- | You can see your job history using '' | + | |
< | < | ||
- | [sagon@master | + | (yggdrasil)-[root@admin1 |
- | | + | |
- | ------------ ---------- ---------- ---------- ---------- ---------- -------- | + | |
- | 45517641 | + | |
- | 45517641.ba+ | + | 4 39919765.ba+ 1298188K |
- | 45517641.ex+ extern | + | |
- | 45517641.0 | + | |
- | 45518119 | + | |
- | 45518119.ba+ | + | |
- | 45518119.ex+ | + | |
</ | </ | ||
+ | </ | ||
+ | ===== Energy usage ===== | ||
+ | ==== CPUs ==== | ||
+ | You can see the energy consumption of your jobs on Yggdrasil (Baobab soon). The energy is shown in Joules using sacct. | ||
- | ===== Report and statistics with sreport ===== | + | < |
- | + | (yggdrasil)-[root@admin1 state] | |
- | To get reporting about your past jobs, you can use '' | + | |
- | + | ------------------- ---------- ------------ -------------- ----------------- | |
- | Here are some examples that can give you a starting point : | + | 2023-10-12T09:48:28 COMPLETED 28478878 |
- | + | 2023-10-12T09:48:28 COMPLETED 28478878.ex+ | |
- | To get the number of jobs you ran (you <=> '' | + | 2023-10-12T09: |
- | + | ||
- | <code console> | + | |
- | [brero@login2 ~]$ sreport job sizesbyaccount user=$USER PrintJobCount start=2018-01-01 end=2019-01-01 | + | |
- | + | ||
- | -------------------------------------------------------------------------------- | + | |
- | Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59 (31536000 secs) | + | |
- | Units are in number of jobs ran | + | |
- | -------------------------------------------------------------------------------- | + | |
- | Cluster | + | |
- | --------- --------- ------------- ------------- ------------- ------------- ------------- ------------ | + | |
- | | + | |
</ | </ | ||
+ | <note important> | ||
- | You can see how many jobs were run (grouped | + | ==== GPUs ==== |
- | < | + | If you are interested |
+ | < | ||
+ | (baobab)-[root@gpu002 ~]$ nvidia-smi dmon --select p --id 0 | ||
- | You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 : | + | # gpu pwr gtemp mtemp |
- | + | # Idx W C C | |
- | <code console> | + | |
- | [brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds | + | |
- | -------------------------------------------------------------------------------- | + | |
- | Cluster/ | + | |
- | Usage reported in CPU Seconds | + | |
- | -------------------------------------------------------------------------------- | + | |
- | | + | |
- | --------- --------------- --------- --------------- -------- -------- | + | |
- | | + | |
</ | </ | ||
- | In this example, we added the time '' | ||
- | Please note : | ||
- | * By default, the CPU time is in Minutes | ||
- | * It takes up to an hour for Slurm to upate this information in its database, so be patient | ||
- | * If you don't specify a start, nor an end date, yesterday' | ||
- | * The CPU time is the time that was allocated to you. It doesn' | ||
- | |||
- | Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow' | ||
- | |||
- | < | ||
- | sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date=" | ||
- | </ | ||
hpc/slurm.1665659589.txt.gz · Last modified: 2022/10/13 11:13 by Yann Sagon