Differences

This shows you the differences between two versions of the page.

--- hpc:slurm [2022/08/08 08:06] – Yann Sagon
+++ hpc:slurm [2025/04/08 17:05] (current) – [Clusters partitions] Adrien Albert
@@ Line 1: / Line 1: @@
-<title> Slurm and job management </title>
+{{METATOC 1-5}}
+====== Slurm and job management ======
 <note>
@@ Line 67: / Line 68: @@
   * Special public partitions:
     * ''debug-cpu'' - to test your CPU jobs and make sure everything works fine (max. 15 min)
-    * ''debug-gpu'' - to test your GPU jobs and make sure everything works fine (max. 15 min)
+    * ''public-interactive-gpu'' - Run interactive jobs or to test your GPU jobs and make sure everything works fine (max. 04h)
     * ''public-interactive-cpu'' - for interactive CPU jobs (max. of 6 cores for 8h)
     * ''public-longrun-cpu'' - for CPU jobs that don't need much resources, but need a longer runtime time (max. of 2 cores for 14 days)
@@ Line 112: / Line 113: @@
 ^ Partition             ^Time Limit ^Max mem per core ^
 |debug-cpu              |15 Minutes |full node memory |
-|debug-gpu              |15 Minutes |full node memory |
+|public-interactive-gpu |4 hours    |full node memory |
 |public-interactive-cpu |8 hours    |10GB             |
 |public-longrun-cpu     |14 Days    |10GB             |
@@ Line 125: / Line 126: @@
 Minimum resource is one core.
-N.B. : no ''debug-gpu'', nor ''public-gpu'' partitions on Baobab, as there are only private GPU nodes.
+N.B. : no ''public-interactive-gpu'', nor ''public-gpu'' partitions on Baobab, as there are only private GPU nodes.
@@ Line 205: / Line 206: @@
 Example to request three titan cards: ''<nowiki>--gpus=titan:3</nowiki>''.
 You can find a detailed list of GPUs available on our clusters here :
@@ Line 216: / Line 218: @@
   * [[http://baobabmaster.unige.ch/slurmdoc/gres.html|Generic Resource (GRES) Scheduling in SLURM]]
+===== CPU =====
+<WRAP center round important 60%>
+You can request all the CPUs of a compute node minus two that are reserved for the OS. See [[https://slurm.schedmd.com/core_spec.html|slurm core spec]]
+</WRAP>
 ===== CPU types =====
@@ Line 246: / Line 251: @@
 If you want a list of those specifications, please check : [[hpc:hpc_clusters#compute_nodes|For advanced users - Compute nodes]]
+===== Single thread vs multi thread vs distributed jobs =====
+There are three job categories each with different needs:
+^Job type             ^ Number of cpu used                                             ^ Examples         ^ Keywords         ^ Slurm            ^
+| **single threaded** | **one CPU**                                                    | Python, plain R  | -                |
+| **multi threaded**  | **all the CPUs** of a compute node (best case scenario)        | Matlab, Stata-MP | OpenMP, SMP      | <nowiki>--cpus-per-tasks</nowiki> |
+| **distributed**     | can spread tasks on multiple compute nodes                     | Palabos OpenFOAM | OpenMPI, workers | <nowiki>--ntasks</nowiki>         |
+There are also **hybrid** jobs, where each tasks of such a job behave like a multi-threaded job.
+This is not very common and we won't cover this case.
+In slurm, you have two options for asking CPU resources:
+  * ''<nowiki>--cpus-per-tasks</nowiki>'': this will specify that you want more than one CPU per task.
+  * ''<nowiki>--ntasks</nowiki>'': this will launch n time your job. **ONLY** specify a value bigger than one if your job knows how to handle multitasking properly. For example OpenMPI job can benefit of this option. If your job doesn't handle this option correctly, it will be launched n time doing strictly the same things, this is not what you want and will wait resources and create corrupted output files.
 ====== Submitting jobs ======
@@ Line 285: / Line 310: @@
 #SBATCH --output jobname-out.o%j      # optional. By default the error and output files are merged
 #SBATCH --ntasks 1                    # number of tasks in your job. One by default
-#SBATCH --cpus-per-task 1             # number of cpus in your job. One by default
+#SBATCH --cpus-per-task 1             # number of cpus for each task. One by default
 #SBATCH --partition debug-cpu         # the partition to use. By default debug-cpu
 #SBATCH --time 15:00                  # maximum run time.
@@ Line 439: / Line 464: @@
 ===== GPGPU jobs =====
-When we talk about [[https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units|GPGPU]] we mean using a GPU to perform calculation, not for visualisation.
+When we talk about [[https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units|GPGPU]] we mean using a GPU to perform calculation, not for visualization.
+You can see on this table [[hpc:hpc_clusters#gpus_models_on_the_clusters|How our clusters work]] all the GPUs models we have on the cluster. You may notice that we have a very wide range of GPU models, from low end to high end. It is important to select the correct GPU model to avoid to waste resources. The important characteristics of a GPU are:
+  * on board memory in GB
+  * simple precision vs double precision for float calculation
+  * compute capability
+Specify the memory needed. For example, request one GPU that has 10G at least.
+<code>
+srun --gres=gpu:1,VramPerGpu:10G
+</code>
+If you just need a GPU and you don't care of the type, don't specify it. You'll get the lower model available.
+<code>
+#SBATCH --gpus=1
+</code>
+Example to request two double precision GPU model:
+<code>
+#!/bin/sh
+#SBATCH --partition=shared-gpu
+#SBATCH --gpus=2
+#SBATCH --constraint=DOUBLE_PRECISION_GPU
+srun nvidia-smi
+</code>
+It's not possible to put two types in the GRES request, but you can ask for specific compute capability, for example you want to request any GPU model with compute capability bigger or equal to 7.5:
+Example
+<code>
+#!/bin/sh
+#SBATCH --partition=shared-gpu
+#SBATCH --gpus=1
+#SBATCH --constraint="COMPUTE_CAPABILITY_7_5|COMPUTE_CAPABILITY_8_0|COMPUTE_CAPABILITY_8_6|"
+</code>
 Example of script (see also https://gitlab.unige.ch/hpc/softs/tree/master/c/cuda ):
@@ Line 473: / Line 537: @@
 In this case, this means that node gpu002 has three Titan cards, and all of them are allocated.
-If you just need a gpu and you don't care of the type, don't specify it:
-<code>
-#SBATCH --gpus=1
-</code>
-It's not possible to put two types in the GRES request, but you can ask for specific compute capability, for example you don't want a compute capability smaller than 6.0:
-<code>
-#SBATCH --gpus=1
-#SBATCH --constraint="COMPUTE_CAPABILITY_6_0|COMPUTE_CAPABILITY_6_1"
-</code>
 ===== Interactive jobs =====
@@ Line 618: / Line 671: @@
   * [[https://hpc-community.unige.ch/t/tutorial-how-to-automatically-restart-a-slurm-job-after-time-limit/517|[tutorial] How to automatically restart a slurm job after time limit]]
+====== Reservation ======
+ to request for reservation; contact the HPC team following the instruction https://doc.eresearch.unige.ch/hpc/start > contact the hpc team by email
+list reservation:
+  (baobab)-[alberta@login2 ~]#scontrol show res
+Use reservation via srun:
+  (baobab)-[alberta@login2 ~]# srun --reservation <reservation_name> hostname
+Use reservation via script sbatch:
+  #SBATCH --reservation <reservation_name>
+  #!/bin/bash
+  #SBATCH --job-name=test_unitaire
+  #SBATCH --reservation test
+  srun hostname
 ====== Job monitoring ======
@@ Line 694: / Line 768: @@
 If you want other information please see the sacct manpage.
+<note tip>by default the command displays a lot of fields. You can use this trick to display them correctly. Then you can move with left and right arrows to see the remaining fields
-===== Job history =====
-You can see your job history using ''sacct'':
 <code>
-[sagon@master ~] $ sacct -u $USER -S 2021-04-01
+(yggdrasil)-[root@admin1 ~]$ sstat -j 39919765 --all | less -#2 -N -S
-       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
+ JobID         MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUF>
------------- ---------- ---------- ---------- ---------- ---------- --------
+ ------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------->
-45517641        jobname  debug-cpu   rossigno          1     FAILED      2:0
+39919765.ex+    489808K         cpu095              0      5584K      1728K     cpu095          0      1728K        0       cpu095              0          0   00:00:00     cpu095          0   00:00:00        1      2.80M        >
-45517641.ba+      batch              rossigno          1     FAILED      2:0
+39919765.ba+   1298188K         cpu095              0   1298188K    599588K     cpu095          0    599588K     2511       cpu095              0       2511   00:39:25     cpu095          0   00:39:25        1       984K        >
-45517641.ex+     extern              rossigno          1  COMPLETED      0:0
-45517641.0            R              rossigno          1     FAILED      2:0
-45518119        jobname  debug-cpu   rossigno          1  COMPLETED      0:0
-45518119.ba+      batch              rossigno          1  COMPLETED      0:0
-45518119.ex+     extern              rossigno          1  COMPLETED      0:0
 </code>
+</note>
+===== Energy usage =====
+==== CPUs ====
+You can see the energy consumption of your jobs on Yggdrasil (Baobab soon). The energy is shown in Joules using sacct.
-===== Report and statistics with sreport =====
+<code>
+(yggdrasil)-[root@admin1 state] (master *)$ sacct  --format=Start,State,JobID,ConsumedEnergy,ConsumedEnergyRaw --units=k -j 28478878
-To get reporting about your past jobs, you can use ''sreport'' (https://slurm.schedmd.com/sreport.html).
+              Start      State JobID        ConsumedEnergy ConsumedEnergyRaw
+------------------- ---------- ------------ -------------- -----------------
-Here are some examples that can give you a starting point :
+-10-12T09:48:28  COMPLETED 28478878             43.13K             43127
+-10-12T09:48:28  COMPLETED 28478878.ex+         43.13K             43127
-To get the number of jobs you ran (you <=> ''$USER'') in 2018 (dates in yyyy-mm-dd format) :
+-10-12T09:48:28  COMPLETED 28478878.0           43.11K             43109
-<code console>
-[brero@login2 ~]$ sreport job sizesbyaccount user=$USER PrintJobCount start=2018-01-01 end=2019-01-01
---------------------------------------------------------------------------------
-Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59 (31536000 secs)
-Units are in number of jobs ran
---------------------------------------------------------------------------------
-  Cluster   Account     0-49 CPUs   50-249 CPUs  250-499 CPUs  500-999 CPUs  >= 1000 CPUs % of cluster
---------- --------- ------------- ------------- ------------- ------------- ------------- ------------
-   baobab      root           180            40             4            15             0      100.00%
 </code>
+<note important>It is working only for Intel nodes (at least for the moment). Only in the case of an exclusive job allocation does this value reflect the job's real energy consumption. </note>
-You can see how many jobs were run (grouped by allocated CPU). You can also see we  specified an extra day for the //end date// ''end=2019-01-01'' in order to cover the whole year :
+==== GPUs ====
-<code>Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59''</code>
+If you are interested by the power usage of a GPU card your job is using, you can issue the following command while your job is running on a GPU node:
+<code>
+(baobab)-[root@gpu002 ~]$ nvidia-smi dmon --select p --id 0
-You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 :
+# gpu    pwr  gtemp  mtemp
+# Idx      W      C      C
-<code console>
+     63     55      -
-[brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds
+     59     55      -
---------------------------------------------------------------------------------
+     62     55      -
-Cluster/Account/User Utilization 2019-09-01T00:00:00 - 2019-09-09T23:59:59 (64800 secs)
-Usage reported in CPU Seconds
---------------------------------------------------------------------------------
-  Cluster         Account     Login     Proper Name     Used   Energy
---------- --------------- --------- --------------- -------- --------
-   baobab        rossigno     brero   BRERO Massimo     1159        0
 </code>
-In this example, we added the time ''-t Seconds'' parameter to have the output in seconds. //Minutes// or //Hours// are also possible.
-Please note :
-  * By default, the CPU time is in Minutes
-  * It takes up to an hour for Slurm to upate this information in its database, so be patient
-  * If you don't specify a start, nor an end date, yesterday's date will be used.
-  * The CPU time is the time that was allocated to you. It doesn't matter if the CPU was actually used or not. So let's say you ask for 15min allocation, then do nothing for 3 minutes then run 1 CPU at 100% for 4 minutes and exit the allocation, then 7 minutes will be added to your CPU time.
-Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow's date :
-<code>
-sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date="tomorrow" +%Y-%m-%d) -t seconds
-</code>