Differences

This shows you the differences between two versions of the page.

--- hpc:slurm [2022/09/29 12:35] – [GPGPU jobs] Yann Sagon
+++ hpc:slurm [2025/07/25 10:18] (current) – [spart] Yann Sagon
@@ Line 1: / Line 1: @@
-<title> Slurm and job management </title>
+{{METATOC 1-5}}
+====== Slurm and job management ======
 <note>
@@ Line 67: / Line 68: @@
   * Special public partitions:
     * ''debug-cpu'' - to test your CPU jobs and make sure everything works fine (max. 15 min)
-    * ''debug-gpu'' - to test your GPU jobs and make sure everything works fine (max. 15 min)
+    * ''public-interactive-gpu'' - Run interactive jobs or to test your GPU jobs and make sure everything works fine (max. 04h)
     * ''public-interactive-cpu'' - for interactive CPU jobs (max. of 6 cores for 8h)
     * ''public-longrun-cpu'' - for CPU jobs that don't need much resources, but need a longer runtime time (max. of 2 cores for 14 days)
@@ Line 112: / Line 113: @@
 ^ Partition             ^Time Limit ^Max mem per core ^
 |debug-cpu              |15 Minutes |full node memory |
-|debug-gpu              |15 Minutes |full node memory |
+|public-interactive-gpu |4 hours    |full node memory |
 |public-interactive-cpu |8 hours    |10GB             |
 |public-longrun-cpu     |14 Days    |10GB             |
@@ Line 125: / Line 126: @@
 Minimum resource is one core.
-N.B. : no ''debug-gpu'', nor ''public-gpu'' partitions on Baobab, as there are only private GPU nodes.
+N.B. : no ''public-interactive-gpu'', nor ''public-gpu'' partitions on Baobab, as there are only private GPU nodes.
@@ Line 136: / Line 137: @@
 ^ Partition             ^Time Limit^Max mem per core  ^default Mem Per core ^
 | private-<privatename> |7 Days    | full node memory | 3GB                 |
-To see details about a given partition, go to the web page https://baobab.unige.ch and click on the "status" tab.
-If you belong in one of these groups, please contact us to request to have access to the correct partition as we have to manually add you.
@@ Line 205: / Line 202: @@
 Example to request three titan cards: ''<nowiki>--gpus=titan:3</nowiki>''.
 You can find a detailed list of GPUs available on our clusters here :
@@ Line 216: / Line 214: @@
   * [[http://baobabmaster.unige.ch/slurmdoc/gres.html|Generic Resource (GRES) Scheduling in SLURM]]
+===== CPU =====
+<WRAP center round important 60%>
+You can request all the CPUs of a compute node minus two that are reserved for the OS. See [[https://slurm.schedmd.com/core_spec.html|slurm core spec]]
+</WRAP>
 ===== CPU types =====
@@ Line 246: / Line 247: @@
 If you want a list of those specifications, please check : [[hpc:hpc_clusters#compute_nodes|For advanced users - Compute nodes]]
+===== Single thread vs multi thread vs distributed jobs =====
+There are three job categories each with different needs:
+^Job type             ^ Number of cpu used                                             ^ Examples         ^ Keywords         ^ Slurm            ^
+| **single threaded** | **one CPU**                                                    | Python, plain R  | -                |
+| **multi threaded**  | **all the CPUs** of a compute node (best case scenario)        | Matlab, Stata-MP | OpenMP, SMP      | <nowiki>--cpus-per-tasks</nowiki> |
+| **distributed**     | can spread tasks on multiple compute nodes                     | Palabos OpenFOAM | OpenMPI, workers | <nowiki>--ntasks</nowiki>         |
+There are also **hybrid** jobs, where each tasks of such a job behave like a multi-threaded job.
+This is not very common and we won't cover this case.
+In slurm, you have two options for asking CPU resources:
+  * ''<nowiki>--cpus-per-tasks</nowiki>'': this will specify that you want more than one CPU per task.
+  * ''<nowiki>--ntasks</nowiki>'': this will launch n time your job. **ONLY** specify a value bigger than one if your job knows how to handle multitasking properly. For example OpenMPI job can benefit of this option. If your job doesn't handle this option correctly, it will be launched n time doing strictly the same things, this is not what you want and will wait resources and create corrupted output files.
 ====== Submitting jobs ======
@@ Line 285: / Line 306: @@
 #SBATCH --output jobname-out.o%j      # optional. By default the error and output files are merged
 #SBATCH --ntasks 1                    # number of tasks in your job. One by default
-#SBATCH --cpus-per-task 1             # number of cpus in your job. One by default
+#SBATCH --cpus-per-task 1             # number of cpus for each task. One by default
 #SBATCH --partition debug-cpu         # the partition to use. By default debug-cpu
 #SBATCH --time 15:00                  # maximum run time.
@@ Line 439: / Line 460: @@
 ===== GPGPU jobs =====
-When we talk about [[https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units|GPGPU]] we mean using a GPU to perform calculation, not for visualisation.
+When we talk about [[https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units|GPGPU]] we mean using a GPU to perform calculation, not for visualization.
 You can see on this table [[hpc:hpc_clusters#gpus_models_on_the_clusters|How our clusters work]] all the GPUs models we have on the cluster. You may notice that we have a very wide range of GPU models, from low end to high end. It is important to select the correct GPU model to avoid to waste resources. The important characteristics of a GPU are:
@@ Line 459: / Line 480: @@
 </code>
-It's not possible to put two types in the GRES request, but you can ask for specific compute capability, for example you don't want a compute capability smaller than 6.0:
+Example to request two double precision GPU model:
 <code>
-#SBATCH --gpus=1
+#!/bin/sh
-#SBATCH --constraint="COMPUTE_CAPABILITY_6_0|COMPUTE_CAPABILITY_6_1|COMPUTE_CAPABILITY_7_5|COMPUTE_CAPABILITY_8_0|COMPUTE_CAPABILITY_8_6"
+#SBATCH --partition=shared-gpu
+#SBATCH --gpus=2
+#SBATCH --constraint=DOUBLE_PRECISION_GPU
+srun nvidia-smi
 </code>
+It's not possible to put two types in the GRES request, but you can ask for specific compute capability, for example you want to request any GPU model with compute capability bigger or equal to 7.5:
+Example
+<code>
+#!/bin/sh
+#SBATCH --partition=shared-gpu
+#SBATCH --gpus=1
+#SBATCH --constraint="COMPUTE_CAPABILITY_7_5|COMPUTE_CAPABILITY_8_0|COMPUTE_CAPABILITY_8_6|"
+</code>
 Example of script (see also https://gitlab.unige.ch/hpc/softs/tree/master/c/cuda ):
@@ Line 643: / Line 677: @@
 Use reservation via srun:
-  (baobab)-[alberta@login2 ~]# srun --partition <partition_name> hostname
+  (baobab)-[alberta@login2 ~]# srun --reservation <reservation_name> hostname
 Use reservation via script sbatch:
@@ Line 730: / Line 764: @@
 If you want other information please see the sacct manpage.
+<note tip>by default the command displays a lot of fields. You can use this trick to display them correctly. Then you can move with left and right arrows to see the remaining fields
-===== Job history =====
-You can see your job history using ''sacct'':
 <code>
-[sagon@master ~] $ sacct -u $USER -S 2021-04-01
+(yggdrasil)-[root@admin1 ~]$ sstat -j 39919765 --all | less -#2 -N -S
-       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
+ JobID         MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUF>
------------- ---------- ---------- ---------- ---------- ---------- --------
+ ------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------->
-45517641        jobname  debug-cpu   rossigno          1     FAILED      2:0
+39919765.ex+    489808K         cpu095              0      5584K      1728K     cpu095          0      1728K        0       cpu095              0          0   00:00:00     cpu095          0   00:00:00        1      2.80M        >
-45517641.ba+      batch              rossigno          1     FAILED      2:0
+39919765.ba+   1298188K         cpu095              0   1298188K    599588K     cpu095          0    599588K     2511       cpu095              0       2511   00:39:25     cpu095          0   00:39:25        1       984K        >
-45517641.ex+     extern              rossigno          1  COMPLETED      0:0
-45517641.0            R              rossigno          1     FAILED      2:0
-45518119        jobname  debug-cpu   rossigno          1  COMPLETED      0:0
-45518119.ba+      batch              rossigno          1  COMPLETED      0:0
-45518119.ex+     extern              rossigno          1  COMPLETED      0:0
 </code>
+</note>
+===== Energy usage =====
+==== CPUs ====
+You can see the energy consumption of your jobs on Yggdrasil (Baobab soon). The energy is shown in Joules using sacct.
-===== Report and statistics with sreport =====
+<code>
+(yggdrasil)-[root@admin1 state] (master *)$ sacct  --format=Start,State,JobID,ConsumedEnergy,ConsumedEnergyRaw --units=k -j 28478878
-To get reporting about your past jobs, you can use ''sreport'' (https://slurm.schedmd.com/sreport.html).
+              Start      State JobID        ConsumedEnergy ConsumedEnergyRaw
+------------------- ---------- ------------ -------------- -----------------
-Here are some examples that can give you a starting point :
+-10-12T09:48:28  COMPLETED 28478878             43.13K             43127
+-10-12T09:48:28  COMPLETED 28478878.ex+         43.13K             43127
-To get the number of jobs you ran (you <=> ''$USER'') in 2018 (dates in yyyy-mm-dd format) :
+-10-12T09:48:28  COMPLETED 28478878.0           43.11K             43109
-<code console>
-[brero@login2 ~]$ sreport job sizesbyaccount user=$USER PrintJobCount start=2018-01-01 end=2019-01-01
---------------------------------------------------------------------------------
-Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59 (31536000 secs)
-Units are in number of jobs ran
---------------------------------------------------------------------------------
-  Cluster   Account     0-49 CPUs   50-249 CPUs  250-499 CPUs  500-999 CPUs  >= 1000 CPUs % of cluster
---------- --------- ------------- ------------- ------------- ------------- ------------- ------------
-   baobab      root           180            40             4            15             0      100.00%
 </code>
+<note important>It is working only for Intel nodes (at least for the moment). Only in the case of an exclusive job allocation does this value reflect the job's real energy consumption. </note>
-You can see how many jobs were run (grouped by allocated CPU). You can also see we  specified an extra day for the //end date// ''end=2019-01-01'' in order to cover the whole year :
+==== GPUs ====
-<code>Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59''</code>
+If you are interested by the power usage of a GPU card your job is using, you can issue the following command while your job is running on a GPU node:
+<code>
+(baobab)-[root@gpu002 ~]$ nvidia-smi dmon --select p --id 0
-You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 :
+# gpu    pwr  gtemp  mtemp
+# Idx      W      C      C
-<code console>
+     63     55      -
-[brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds
+     59     55      -
---------------------------------------------------------------------------------
+     62     55      -
-Cluster/Account/User Utilization 2019-09-01T00:00:00 - 2019-09-09T23:59:59 (64800 secs)
-Usage reported in CPU Seconds
---------------------------------------------------------------------------------
-  Cluster         Account     Login     Proper Name     Used   Energy
---------- --------------- --------- --------------- -------- --------
-   baobab        rossigno     brero   BRERO Massimo     1159        0
 </code>
-In this example, we added the time ''-t Seconds'' parameter to have the output in seconds. //Minutes// or //Hours// are also possible.
-Please note :
-  * By default, the CPU time is in Minutes
-  * It takes up to an hour for Slurm to upate this information in its database, so be patient
-  * If you don't specify a start, nor an end date, yesterday's date will be used.
-  * The CPU time is the time that was allocated to you. It doesn't matter if the CPU was actually used or not. So let's say you ask for 15min allocation, then do nothing for 3 minutes then run 1 CPU at 100% for 4 minutes and exit the allocation, then 7 minutes will be added to your CPU time.
-Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow's date :
-<code>
-sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date="tomorrow" +%Y-%m-%d) -t seconds
-</code>
@@ Line 803: / Line 807: @@
 ==== spart ====
+<note warning>This tool isn't working anymore and it seems a dead project</note>
 ''spart'' (( https://github.com/mercanca/spart )) is a tool to check the overall partition usage/description.