hpc:slurm
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| hpc:slurm [2022/09/29 12:58] – [GPGPU jobs] Yann Sagon | hpc:slurm [2025/07/25 10:18] (current) – [spart] Yann Sagon | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | < | + | {{METATOC 1-5}} |
| + | ====== | ||
| < | < | ||
| Line 67: | Line 68: | ||
| * Special public partitions: | * Special public partitions: | ||
| * '' | * '' | ||
| - | * '' | + | * '' |
| * '' | * '' | ||
| * '' | * '' | ||
| Line 112: | Line 113: | ||
| ^ Partition | ^ Partition | ||
| |debug-cpu | |debug-cpu | ||
| - | |debug-gpu |15 Minutes | + | |public-interactive-gpu |4 hours |
| |public-interactive-cpu |8 hours |10GB | | |public-interactive-cpu |8 hours |10GB | | ||
| |public-longrun-cpu | |public-longrun-cpu | ||
| Line 125: | Line 126: | ||
| Minimum resource is one core. | Minimum resource is one core. | ||
| - | N.B. : no '' | + | N.B. : no '' |
| Line 136: | Line 137: | ||
| ^ Partition | ^ Partition | ||
| | private-< | | private-< | ||
| - | |||
| - | To see details about a given partition, go to the web page https:// | ||
| - | If you belong in one of these groups, please contact us to request to have access to the correct partition as we have to manually add you. | ||
| - | |||
| Line 205: | Line 202: | ||
| Example to request three titan cards: ''< | Example to request three titan cards: ''< | ||
| + | |||
| You can find a detailed list of GPUs available on our clusters here : | You can find a detailed list of GPUs available on our clusters here : | ||
| Line 216: | Line 214: | ||
| * [[http:// | * [[http:// | ||
| + | ===== CPU ===== | ||
| + | <WRAP center round important 60%> | ||
| + | You can request all the CPUs of a compute node minus two that are reserved for the OS. See [[https:// | ||
| + | </ | ||
| ===== CPU types ===== | ===== CPU types ===== | ||
| Line 246: | Line 247: | ||
| If you want a list of those specifications, | If you want a list of those specifications, | ||
| + | |||
| + | ===== Single thread vs multi thread vs distributed jobs ===== | ||
| + | |||
| + | There are three job categories each with different needs: | ||
| + | |||
| + | ^Job type ^ Number of cpu used ^ Examples | ||
| + | | **single threaded** | **one CPU** | Python, plain R | - | | ||
| + | | **multi threaded** | ||
| + | | **distributed** | ||
| + | |||
| + | |||
| + | There are also **hybrid** jobs, where each tasks of such a job behave like a multi-threaded job. | ||
| + | This is not very common and we won't cover this case. | ||
| + | |||
| + | In slurm, you have two options for asking CPU resources: | ||
| + | |||
| + | * ''< | ||
| + | * ''< | ||
| + | |||
| + | |||
| ====== Submitting jobs ====== | ====== Submitting jobs ====== | ||
| Line 285: | Line 306: | ||
| #SBATCH --output jobname-out.o%j | #SBATCH --output jobname-out.o%j | ||
| #SBATCH --ntasks 1 # number of tasks in your job. One by default | #SBATCH --ntasks 1 # number of tasks in your job. One by default | ||
| - | #SBATCH --cpus-per-task 1 # number of cpus in your job. One by default | + | #SBATCH --cpus-per-task 1 # number of cpus for each task. One by default |
| #SBATCH --partition debug-cpu | #SBATCH --partition debug-cpu | ||
| #SBATCH --time 15:00 # maximum run time. | #SBATCH --time 15:00 # maximum run time. | ||
| Line 439: | Line 460: | ||
| ===== GPGPU jobs ===== | ===== GPGPU jobs ===== | ||
| - | When we talk about [[https:// | + | When we talk about [[https:// |
| You can see on this table [[hpc: | You can see on this table [[hpc: | ||
| Line 477: | Line 498: | ||
| #SBATCH --partition=shared-gpu | #SBATCH --partition=shared-gpu | ||
| #SBATCH --gpus=1 | #SBATCH --gpus=1 | ||
| - | #SBATCH --constraint=" | + | #SBATCH --constraint=" |
| </ | </ | ||
| Line 656: | Line 677: | ||
| Use reservation via srun: | Use reservation via srun: | ||
| - | (baobab)-[alberta@login2 ~]# srun --partition | + | (baobab)-[alberta@login2 ~]# srun --reservation |
| | | ||
| Use reservation via script sbatch: | Use reservation via script sbatch: | ||
| Line 743: | Line 764: | ||
| If you want other information please see the sacct manpage. | If you want other information please see the sacct manpage. | ||
| - | + | <note tip>by default the command displays a lot of fields. | |
| - | ===== Job history ===== | + | |
| - | You can see your job history using '' | + | |
| < | < | ||
| - | [sagon@master | + | (yggdrasil)-[root@admin1 |
| - | | + | |
| - | ------------ ---------- ---------- ---------- ---------- ---------- -------- | + | |
| - | 45517641 | + | |
| - | 45517641.ba+ | + | 4 39919765.ba+ 1298188K |
| - | 45517641.ex+ extern | + | |
| - | 45517641.0 | + | |
| - | 45518119 | + | |
| - | 45518119.ba+ | + | |
| - | 45518119.ex+ | + | |
| </ | </ | ||
| + | </ | ||
| + | ===== Energy usage ===== | ||
| + | ==== CPUs ==== | ||
| + | You can see the energy consumption of your jobs on Yggdrasil (Baobab soon). The energy is shown in Joules using sacct. | ||
| - | ===== Report and statistics with sreport ===== | + | < |
| - | + | (yggdrasil)-[root@admin1 state] | |
| - | To get reporting about your past jobs, you can use '' | + | |
| - | + | ------------------- ---------- ------------ -------------- ----------------- | |
| - | Here are some examples that can give you a starting point : | + | 2023-10-12T09:48:28 COMPLETED 28478878 |
| - | + | 2023-10-12T09:48:28 COMPLETED 28478878.ex+ | |
| - | To get the number of jobs you ran (you <=> '' | + | 2023-10-12T09: |
| - | + | ||
| - | <code console> | + | |
| - | [brero@login2 ~]$ sreport job sizesbyaccount user=$USER PrintJobCount start=2018-01-01 end=2019-01-01 | + | |
| - | + | ||
| - | -------------------------------------------------------------------------------- | + | |
| - | Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59 (31536000 secs) | + | |
| - | Units are in number of jobs ran | + | |
| - | -------------------------------------------------------------------------------- | + | |
| - | Cluster | + | |
| - | --------- --------- ------------- ------------- ------------- ------------- ------------- ------------ | + | |
| - | | + | |
| </ | </ | ||
| + | <note important> | ||
| - | You can see how many jobs were run (grouped | + | ==== GPUs ==== |
| - | < | + | If you are interested |
| + | < | ||
| + | (baobab)-[root@gpu002 ~]$ nvidia-smi dmon --select p --id 0 | ||
| - | You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 : | + | # gpu pwr gtemp mtemp |
| - | + | # Idx W C C | |
| - | <code console> | + | |
| - | [brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds | + | |
| - | -------------------------------------------------------------------------------- | + | |
| - | Cluster/ | + | |
| - | Usage reported in CPU Seconds | + | |
| - | -------------------------------------------------------------------------------- | + | |
| - | | + | |
| - | --------- --------------- --------- --------------- -------- -------- | + | |
| - | | + | |
| </ | </ | ||
| - | In this example, we added the time '' | ||
| - | Please note : | ||
| - | * By default, the CPU time is in Minutes | ||
| - | * It takes up to an hour for Slurm to upate this information in its database, so be patient | ||
| - | * If you don't specify a start, nor an end date, yesterday' | ||
| - | * The CPU time is the time that was allocated to you. It doesn' | ||
| - | |||
| - | Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow' | ||
| - | |||
| - | < | ||
| - | sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date=" | ||
| - | </ | ||
| Line 816: | Line 807: | ||
| ==== spart ==== | ==== spart ==== | ||
| + | |||
| + | <note warning> | ||
| '' | '' | ||
hpc/slurm.1664456328.txt.gz · Last modified: (external edit)