Differences

This shows you the differences between two versions of the page.

--- hpc:slurm [2023/10/12 09:53] – [Energy usage] Yann Sagon
+++ hpc:slurm [2025/07/25 10:18] (current) – [spart] Yann Sagon
@@ Line 68: / Line 68: @@
   * Special public partitions:
     * ''debug-cpu'' - to test your CPU jobs and make sure everything works fine (max. 15 min)
-    * ''debug-gpu'' - to test your GPU jobs and make sure everything works fine (max. 15 min)
+    * ''public-interactive-gpu'' - Run interactive jobs or to test your GPU jobs and make sure everything works fine (max. 04h)
     * ''public-interactive-cpu'' - for interactive CPU jobs (max. of 6 cores for 8h)
     * ''public-longrun-cpu'' - for CPU jobs that don't need much resources, but need a longer runtime time (max. of 2 cores for 14 days)
@@ Line 113: / Line 113: @@
 ^ Partition             ^Time Limit ^Max mem per core ^
 |debug-cpu              |15 Minutes |full node memory |
-|debug-gpu              |15 Minutes |full node memory |
+|public-interactive-gpu |4 hours    |full node memory |
 |public-interactive-cpu |8 hours    |10GB             |
 |public-longrun-cpu     |14 Days    |10GB             |
@@ Line 126: / Line 126: @@
 Minimum resource is one core.
-N.B. : no ''debug-gpu'', nor ''public-gpu'' partitions on Baobab, as there are only private GPU nodes.
+N.B. : no ''public-interactive-gpu'', nor ''public-gpu'' partitions on Baobab, as there are only private GPU nodes.
@@ Line 137: / Line 137: @@
 ^ Partition             ^Time Limit^Max mem per core  ^default Mem Per core ^
 | private-<privatename> |7 Days    | full node memory | 3GB                 |
-To see details about a given partition, go to the web page https://baobab.unige.ch and click on the "status" tab.
-If you belong in one of these groups, please contact us to request to have access to the correct partition as we have to manually add you.
@@ Line 206: / Line 202: @@
 Example to request three titan cards: ''<nowiki>--gpus=titan:3</nowiki>''.
 You can find a detailed list of GPUs available on our clusters here :
@@ Line 217: / Line 214: @@
   * [[http://baobabmaster.unige.ch/slurmdoc/gres.html|Generic Resource (GRES) Scheduling in SLURM]]
+===== CPU =====
+<WRAP center round important 60%>
+You can request all the CPUs of a compute node minus two that are reserved for the OS. See [[https://slurm.schedmd.com/core_spec.html|slurm core spec]]
+</WRAP>
 ===== CPU types =====
@@ Line 460: / Line 460: @@
 ===== GPGPU jobs =====
-When we talk about [[https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units|GPGPU]] we mean using a GPU to perform calculation, not for visualisation.
+When we talk about [[https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units|GPGPU]] we mean using a GPU to perform calculation, not for visualization.
 You can see on this table [[hpc:hpc_clusters#gpus_models_on_the_clusters|How our clusters work]] all the GPUs models we have on the cluster. You may notice that we have a very wide range of GPU models, from low end to high end. It is important to select the correct GPU model to avoid to waste resources. The important characteristics of a GPU are:
@@ Line 498: / Line 498: @@
 #SBATCH --partition=shared-gpu
 #SBATCH --gpus=1
-#SBATCH --constraint="COMPUTE_CAPABILITY_7_5|COMPUTE_CAPABILITY_8_0|COMPUTE_CAPABILITY_8.6|"
+#SBATCH --constraint="COMPUTE_CAPABILITY_7_5|COMPUTE_CAPABILITY_8_0|COMPUTE_CAPABILITY_8_6|"
 </code>
@@ Line 677: / Line 677: @@
 Use reservation via srun:
-  (baobab)-[alberta@login2 ~]# srun --partition <partition_name> hostname
+  (baobab)-[alberta@login2 ~]# srun --reservation <reservation_name> hostname
 Use reservation via script sbatch:
@@ Line 764: / Line 764: @@
 If you want other information please see the sacct manpage.
+<note tip>by default the command displays a lot of fields. You can use this trick to display them correctly. Then you can move with left and right arrows to see the remaining fields
+<code>
+(yggdrasil)-[root@admin1 ~]$ sstat -j 39919765 --all | less -#2 -N -S
+JobID         MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUF>
+------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------->
+39919765.ex+    489808K         cpu095              0      5584K      1728K     cpu095          0      1728K        0       cpu095              0          0   00:00:00     cpu095          0   00:00:00        1      2.80M        >
+39919765.ba+   1298188K         cpu095              0   1298188K    599588K     cpu095          0    599588K     2511       cpu095              0       2511   00:39:25     cpu095          0   00:39:25        1       984K        >
+</code>
+</note>
 ===== Energy usage =====
+==== CPUs ====
 You can see the energy consumption of your jobs on Yggdrasil (Baobab soon). The energy is shown in Joules using sacct.
@@ Line 778: / Line 789: @@
 <note important>It is working only for Intel nodes (at least for the moment). Only in the case of an exclusive job allocation does this value reflect the job's real energy consumption. </note>
+==== GPUs ====
-===== Job history =====
+If you are interested by the power usage of a GPU card your job is using, you can issue the following command while your job is running on a GPU node:
-You can see your job history using ''sacct'':
 <code>
-[sagon@master ~] $ sacct -u $USER -S 2021-04-01
+(baobab)-[root@gpu002 ~]$ nvidia-smi dmon --select p --id 0
-       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------- ---------- ---------- ---------- ---------- ---------- --------
-45517641        jobname  debug-cpu   rossigno          1     FAILED      2:0
-45517641.ba+      batch              rossigno          1     FAILED      2:0
-45517641.ex+     extern              rossigno          1  COMPLETED      0:0
-45517641.0            R              rossigno          1     FAILED      2:0
-45518119        jobname  debug-cpu   rossigno          1  COMPLETED      0:0
-45518119.ba+      batch              rossigno          1  COMPLETED      0:0
-45518119.ex+     extern              rossigno          1  COMPLETED      0:0
-</code>
-===== Report and statistics with sreport =====
-To get reporting about your past jobs, you can use ''sreport'' (https://slurm.schedmd.com/sreport.html).
-Here are some examples that can give you a starting point :
-To get the number of jobs you ran (you <=> ''$USER'') in 2018 (dates in yyyy-mm-dd format) :
-<code console>
-[brero@login2 ~]$ sreport job sizesbyaccount user=$USER PrintJobCount start=2018-01-01 end=2019-01-01
---------------------------------------------------------------------------------
+# gpu    pwr  gtemp  mtemp
-Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59 (31536000 secs)
+# Idx      W      C      C
-Units are in number of jobs ran
+     63     55      -
---------------------------------------------------------------------------------
+     59     55      -
-  Cluster   Account     0-49 CPUs   50-249 CPUs  250-499 CPUs  500-999 CPUs  >= 1000 CPUs % of cluster
+     62     55      -
---------- --------- ------------- ------------- ------------- ------------- ------------- ------------
-   baobab      root           180            40             4            15             0      100.00%
 </code>
-You can see how many jobs were run (grouped by allocated CPU). You can also see we  specified an extra day for the //end date// ''end=2019-01-01'' in order to cover the whole year :
-<code>Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59''</code>
-You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 :
-<code console>
-[brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds
---------------------------------------------------------------------------------
-Cluster/Account/User Utilization 2019-09-01T00:00:00 - 2019-09-09T23:59:59 (64800 secs)
-Usage reported in CPU Seconds
---------------------------------------------------------------------------------
-  Cluster         Account     Login     Proper Name     Used   Energy
---------- --------------- --------- --------------- -------- --------
-   baobab        rossigno     brero   BRERO Massimo     1159        0
-</code>
-In this example, we added the time ''-t Seconds'' parameter to have the output in seconds. //Minutes// or //Hours// are also possible.
-Please note :
-  * By default, the CPU time is in Minutes
-  * It takes up to an hour for Slurm to upate this information in its database, so be patient
-  * If you don't specify a start, nor an end date, yesterday's date will be used.
-  * The CPU time is the time that was allocated to you. It doesn't matter if the CPU was actually used or not. So let's say you ask for 15min allocation, then do nothing for 3 minutes then run 1 CPU at 100% for 4 minutes and exit the allocation, then 7 minutes will be added to your CPU time.
-Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow's date :
-<code>
-sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date="tomorrow" +%Y-%m-%d) -t seconds
-</code>
@@ Line 851: / Line 807: @@
 ==== spart ====
+<note warning>This tool isn't working anymore and it seems a dead project</note>
 ''spart'' (( https://github.com/mercanca/spart )) is a tool to check the overall partition usage/description.