hpc:slurm
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| hpc:slurm [2023/10/12 09:53] – [Energy usage] Yann Sagon | hpc:slurm [2025/07/25 10:18] (current) – [spart] Yann Sagon | ||
|---|---|---|---|
| Line 68: | Line 68: | ||
| * Special public partitions: | * Special public partitions: | ||
| * '' | * '' | ||
| - | * '' | + | * '' |
| * '' | * '' | ||
| * '' | * '' | ||
| Line 113: | Line 113: | ||
| ^ Partition | ^ Partition | ||
| |debug-cpu | |debug-cpu | ||
| - | |debug-gpu |15 Minutes | + | |public-interactive-gpu |4 hours |
| |public-interactive-cpu |8 hours |10GB | | |public-interactive-cpu |8 hours |10GB | | ||
| |public-longrun-cpu | |public-longrun-cpu | ||
| Line 126: | Line 126: | ||
| Minimum resource is one core. | Minimum resource is one core. | ||
| - | N.B. : no '' | + | N.B. : no '' |
| Line 137: | Line 137: | ||
| ^ Partition | ^ Partition | ||
| | private-< | | private-< | ||
| - | |||
| - | To see details about a given partition, go to the web page https:// | ||
| - | If you belong in one of these groups, please contact us to request to have access to the correct partition as we have to manually add you. | ||
| - | |||
| Line 206: | Line 202: | ||
| Example to request three titan cards: ''< | Example to request three titan cards: ''< | ||
| + | |||
| You can find a detailed list of GPUs available on our clusters here : | You can find a detailed list of GPUs available on our clusters here : | ||
| Line 217: | Line 214: | ||
| * [[http:// | * [[http:// | ||
| + | ===== CPU ===== | ||
| + | <WRAP center round important 60%> | ||
| + | You can request all the CPUs of a compute node minus two that are reserved for the OS. See [[https:// | ||
| + | </ | ||
| ===== CPU types ===== | ===== CPU types ===== | ||
| Line 460: | Line 460: | ||
| ===== GPGPU jobs ===== | ===== GPGPU jobs ===== | ||
| - | When we talk about [[https:// | + | When we talk about [[https:// |
| You can see on this table [[hpc: | You can see on this table [[hpc: | ||
| Line 498: | Line 498: | ||
| #SBATCH --partition=shared-gpu | #SBATCH --partition=shared-gpu | ||
| #SBATCH --gpus=1 | #SBATCH --gpus=1 | ||
| - | #SBATCH --constraint=" | + | #SBATCH --constraint=" |
| </ | </ | ||
| Line 677: | Line 677: | ||
| Use reservation via srun: | Use reservation via srun: | ||
| - | (baobab)-[alberta@login2 ~]# srun --partition | + | (baobab)-[alberta@login2 ~]# srun --reservation |
| | | ||
| Use reservation via script sbatch: | Use reservation via script sbatch: | ||
| Line 764: | Line 764: | ||
| If you want other information please see the sacct manpage. | If you want other information please see the sacct manpage. | ||
| + | <note tip>by default the command displays a lot of fields. You can use this trick to display them correctly. Then you can move with left and right arrows to see the remaining fields | ||
| + | < | ||
| + | (yggdrasil)-[root@admin1 ~]$ sstat -j 39919765 --all | less -#2 -N -S | ||
| + | 1 JobID | ||
| + | 2 ------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- -------> | ||
| + | 3 39919765.ex+ | ||
| + | 4 39919765.ba+ | ||
| + | </ | ||
| + | |||
| + | </ | ||
| ===== Energy usage ===== | ===== Energy usage ===== | ||
| + | ==== CPUs ==== | ||
| You can see the energy consumption of your jobs on Yggdrasil (Baobab soon). The energy is shown in Joules using sacct. | You can see the energy consumption of your jobs on Yggdrasil (Baobab soon). The energy is shown in Joules using sacct. | ||
| Line 778: | Line 789: | ||
| <note important> | <note important> | ||
| - | + | ==== GPUs ==== | |
| - | ===== Job history ===== | + | If you are interested by the power usage of a GPU card your job is using, you can issue the following command while your job is running on a GPU node: |
| - | You can see your job history using '' | + | |
| < | < | ||
| - | [sagon@master ~] $ sacct -u $USER -S 2021-04-01 | + | (baobab)-[root@gpu002 |
| - | | + | |
| - | ------------ ---------- ---------- ---------- ---------- ---------- -------- | + | |
| - | 45517641 | + | |
| - | 45517641.ba+ | + | |
| - | 45517641.ex+ | + | |
| - | 45517641.0 | + | |
| - | 45518119 | + | |
| - | 45518119.ba+ | + | |
| - | 45518119.ex+ | + | |
| - | </ | + | |
| - | + | ||
| - | + | ||
| - | + | ||
| - | ===== Report and statistics with sreport ===== | + | |
| - | + | ||
| - | To get reporting about your past jobs, you can use '' | + | |
| - | + | ||
| - | Here are some examples that can give you a starting point : | + | |
| - | + | ||
| - | To get the number of jobs you ran (you <=> '' | + | |
| - | + | ||
| - | <code console> | + | |
| - | [brero@login2 | + | |
| - | -------------------------------------------------------------------------------- | + | # gpu pwr gtemp mtemp |
| - | Job Sizes 2018-01-01T00: | + | # Idx W C C |
| - | Units are in number of jobs ran | + | |
| - | -------------------------------------------------------------------------------- | + | |
| - | | + | 0 62 |
| - | --------- --------- ------------- ------------- ------------- ------------- ------------- ------------ | + | |
| - | | + | |
| </ | </ | ||
| - | You can see how many jobs were run (grouped by allocated CPU). You can also see we specified an extra day for the //end date// '' | ||
| - | < | ||
| - | You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 : | ||
| - | |||
| - | <code console> | ||
| - | [brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds | ||
| - | -------------------------------------------------------------------------------- | ||
| - | Cluster/ | ||
| - | Usage reported in CPU Seconds | ||
| - | -------------------------------------------------------------------------------- | ||
| - | Cluster | ||
| - | --------- --------------- --------- --------------- -------- -------- | ||
| - | | ||
| - | </ | ||
| - | |||
| - | In this example, we added the time '' | ||
| - | |||
| - | Please note : | ||
| - | * By default, the CPU time is in Minutes | ||
| - | * It takes up to an hour for Slurm to upate this information in its database, so be patient | ||
| - | * If you don't specify a start, nor an end date, yesterday' | ||
| - | * The CPU time is the time that was allocated to you. It doesn' | ||
| - | |||
| - | Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow' | ||
| - | |||
| - | < | ||
| - | sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date=" | ||
| - | </ | ||
| Line 851: | Line 807: | ||
| ==== spart ==== | ==== spart ==== | ||
| + | |||
| + | <note warning> | ||
| '' | '' | ||
hpc/slurm.1697104385.txt.gz · Last modified: (external edit)