Differences

This shows you the differences between two versions of the page.

--- hpc:slurm [2024/06/14 14:17] – [CPU] Yann Sagon
+++ hpc:slurm [2025/04/08 17:05] (current) – [Clusters partitions] Adrien Albert
@@ Line 68: / Line 68: @@
   * Special public partitions:
     * ''debug-cpu'' - to test your CPU jobs and make sure everything works fine (max. 15 min)
-    * ''debug-gpu'' - to test your GPU jobs and make sure everything works fine (max. 15 min)
+    * ''public-interactive-gpu'' - Run interactive jobs or to test your GPU jobs and make sure everything works fine (max. 04h)
     * ''public-interactive-cpu'' - for interactive CPU jobs (max. of 6 cores for 8h)
     * ''public-longrun-cpu'' - for CPU jobs that don't need much resources, but need a longer runtime time (max. of 2 cores for 14 days)
@@ Line 113: / Line 113: @@
 ^ Partition             ^Time Limit ^Max mem per core ^
 |debug-cpu              |15 Minutes |full node memory |
-|debug-gpu              |15 Minutes |full node memory |
+|public-interactive-gpu |4 hours    |full node memory |
 |public-interactive-cpu |8 hours    |10GB             |
 |public-longrun-cpu     |14 Days    |10GB             |
@@ Line 126: / Line 126: @@
 Minimum resource is one core.
-N.B. : no ''debug-gpu'', nor ''public-gpu'' partitions on Baobab, as there are only private GPU nodes.
+N.B. : no ''public-interactive-gpu'', nor ''public-gpu'' partitions on Baobab, as there are only private GPU nodes.
@@ Line 464: / Line 464: @@
 ===== GPGPU jobs =====
-When we talk about [[https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units|GPGPU]] we mean using a GPU to perform calculation, not for visualisation.
+When we talk about [[https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units|GPGPU]] we mean using a GPU to perform calculation, not for visualization.
 You can see on this table [[hpc:hpc_clusters#gpus_models_on_the_clusters|How our clusters work]] all the GPUs models we have on the cluster. You may notice that we have a very wide range of GPU models, from low end to high end. It is important to select the correct GPU model to avoid to waste resources. The important characteristics of a GPU are:
@@ Line 502: / Line 502: @@
 #SBATCH --partition=shared-gpu
 #SBATCH --gpus=1
-#SBATCH --constraint="COMPUTE_CAPABILITY_7_5|COMPUTE_CAPABILITY_8_0|COMPUTE_CAPABILITY_8.6|"
+#SBATCH --constraint="COMPUTE_CAPABILITY_7_5|COMPUTE_CAPABILITY_8_0|COMPUTE_CAPABILITY_8_6|"
 </code>
@@ Line 768: / Line 768: @@
 If you want other information please see the sacct manpage.
+<note tip>by default the command displays a lot of fields. You can use this trick to display them correctly. Then you can move with left and right arrows to see the remaining fields
+<code>
+(yggdrasil)-[root@admin1 ~]$ sstat -j 39919765 --all | less -#2 -N -S
+JobID         MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUF>
+------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------->
+39919765.ex+    489808K         cpu095              0      5584K      1728K     cpu095          0      1728K        0       cpu095              0          0   00:00:00     cpu095          0   00:00:00        1      2.80M        >
+39919765.ba+   1298188K         cpu095              0   1298188K    599588K     cpu095          0    599588K     2511       cpu095              0       2511   00:39:25     cpu095          0   00:39:25        1       984K        >
+</code>
+</note>
 ===== Energy usage =====
 ==== CPUs ====
@@ Line 795: / Line 805: @@
 </code>
-===== Job history =====
-You can see your job history using ''sacct'':
-<code>
-[sagon@master ~] $ sacct -u $USER -S 2021-04-01
-       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------- ---------- ---------- ---------- ---------- ---------- --------
-45517641        jobname  debug-cpu   rossigno          1     FAILED      2:0
-45517641.ba+      batch              rossigno          1     FAILED      2:0
-45517641.ex+     extern              rossigno          1  COMPLETED      0:0
-45517641.0            R              rossigno          1     FAILED      2:0
-45518119        jobname  debug-cpu   rossigno          1  COMPLETED      0:0
-45518119.ba+      batch              rossigno          1  COMPLETED      0:0
-45518119.ex+     extern              rossigno          1  COMPLETED      0:0
-</code>
-===== Report and statistics with sreport =====
-To get reporting about your past jobs, you can use ''sreport'' (https://slurm.schedmd.com/sreport.html).
-Here are some examples that can give you a starting point :
-To get the number of jobs you ran (you <=> ''$USER'') in 2018 (dates in yyyy-mm-dd format) :
-<code console>
-[brero@login2 ~]$ sreport job sizesbyaccount user=$USER PrintJobCount start=2018-01-01 end=2019-01-01
---------------------------------------------------------------------------------
-Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59 (31536000 secs)
-Units are in number of jobs ran
---------------------------------------------------------------------------------
-  Cluster   Account     0-49 CPUs   50-249 CPUs  250-499 CPUs  500-999 CPUs  >= 1000 CPUs % of cluster
---------- --------- ------------- ------------- ------------- ------------- ------------- ------------
-   baobab      root           180            40             4            15             0      100.00%
-</code>
-You can see how many jobs were run (grouped by allocated CPU). You can also see we  specified an extra day for the //end date// ''end=2019-01-01'' in order to cover the whole year :
-<code>Job Sizes 2018-01-01T00:00:00 - 2018-12-31T23:59:59''</code>
-You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 :
-<code console>
-[brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds
---------------------------------------------------------------------------------
-Cluster/Account/User Utilization 2019-09-01T00:00:00 - 2019-09-09T23:59:59 (64800 secs)
-Usage reported in CPU Seconds
---------------------------------------------------------------------------------
-  Cluster         Account     Login     Proper Name     Used   Energy
---------- --------------- --------- --------------- -------- --------
-   baobab        rossigno     brero   BRERO Massimo     1159        0
-</code>
-In this example, we added the time ''-t Seconds'' parameter to have the output in seconds. //Minutes// or //Hours// are also possible.
-Please note :
-  * By default, the CPU time is in Minutes
-  * It takes up to an hour for Slurm to upate this information in its database, so be patient
-  * If you don't specify a start, nor an end date, yesterday's date will be used.
-  * The CPU time is the time that was allocated to you. It doesn't matter if the CPU was actually used or not. So let's say you ask for 15min allocation, then do nothing for 3 minutes then run 1 CPU at 100% for 4 minutes and exit the allocation, then 7 minutes will be added to your CPU time.
-Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow's date :
-<code>
-sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date="tomorrow" +%Y-%m-%d) -t seconds
-</code>