Differences

This shows you the differences between two versions of the page.

--- hpc:accounting [2025/01/23 10:23] – Yann Sagon
+++ hpc:accounting [2025/03/13 09:57] (current) – [Report and statistics with sreport] Yann Sagon
@@ Line 4: / Line 4: @@
-===== Differences between sreport, sacct and sshare =====
+===== Comparison of sreport, sacct, and sshare =====
-  * sacct: Displays only account jobs, excluding time requested via reservation. If duplicate jobs exist, only one is returned.
+We use **sreport** as our primary accounting reference. However, you may find other tools useful for specific purposes. Here's a comparison:
-  * sreport: By default, the report is truncated if a job's wall time overlaps the report's time span. For jobs using a reservation, the idle requested time is distributed among all users with access to the reservation.
-  * sshare: Avoid using sshare as an accounting reference; the displayed values are adjusted due to fairshare calculations.
+  * **sacct**: Displays only account jobs, excluding time allocated via reservations. If duplicate jobs exist, only one is shown.
+  * **sreport**: By default, jobs with wall times overlapping the report's time range are truncated. For reservation-based jobs, the requested idle time is distributed among all users with access to the reservation.
+  * **sshare**: Not recommended for accounting purposes; displayed values are adjusted based on fairshare calculations.
 ===== Resource accounting uniformization =====
@@ Line 21: / Line 21: @@
 For example, a job using a GPU with a weight of 10 for 2 hours and memory equivalent to 5 CPU hours would be billed as 25 CPU hours. This approach ensures consistent, transparent, and fair resource accounting across all heterogeneous components of the cluster.
+You can see the detail of the conversion by looking at the parameter of a random partition on any of the clusters. We are using the same conversion table everywhere.
+<code>
+(bamboo)-[root@slurm1 ~]$ scontrol show partition debug-cpu | grep TRESBillingWeights | tr "," "\n"
+   TRESBillingWeights=CPU=1.0
+Mem=0.25G
+GRES/gpu=1
+GRES/gpu:nvidia_a100-pcie-40gb=5
+GRES/gpu:nvidia_a100_80gb_pcie=8
+GRES/gpu:nvidia_geforce_rtx_2080_ti=2
+GRES/gpu:nvidia_geforce_rtx_3080=3
+GRES/gpu:nvidia_geforce_rtx_3090=5
+GRES/gpu:nvidia_geforce_rtx_4090=8
+GRES/gpu:nvidia_rtx_a5000=5
+GRES/gpu:nvidia_rtx_a5500=5
+GRES/gpu:nvidia_rtx_a6000=8
+GRES/gpu:nvidia_titan_x=1
+GRES/gpu:tesla_p100-pcie-12gb=1
+</code>
+Here you can see for example that using a gpu nvidia_a100-pcie-40gb for 1 hour is equivalent in term of cost to use 5 CPUhour.
+===== Resources available for research group =====
+Research groups that have invested in the HPC cluster by purchasing private CPU or GPU nodes benefit from high priority access to these resources.
+While these nodes remain available to all users, owners receive priority scheduling and a designated number of included compute hours per year.
+To check the details of their owned resources, users can run the script ''ug_getNodeCharacteristicsSummary.sh'', which provides a summary of the node characteristics within the cluster.
+Example:
+<code>
+ug_getNodeCharacteristicsSummary.sh --partitions private-<group>-gpu private-<group>-cpu --cluster <cluster> --summary
+host    sn             cpu    mem    gpunumber    gpudeleted  gpumodel                      gpumemory  purchasedate      months remaining in prod. (Jan 2025)    billing
+------  -----------  -----  -----  -----------  ------------  --------------------------  -----------  --------------  --------------------------------------  ---------
+cpu084  N-20.02.151     36    187            0             0                                        0  2020-02-01                                           1         79
+cpu085  N-20.02.152     36    187            0             0                                        0  2020-02-01                                           1         79
+cpu086  N-20.02.153     36    187            0             0                                        0  2020-02-01                                           1         79
+cpu087  N-20.02.154     36    187            0             0                                        0  2020-02-01                                           1         79
+cpu088  N-20.02.155     36    187            0             0                                        0  2020-02-01                                           1         79
+cpu089  N-20.02.156     36    187            0             0                                        0  2020-02-01                                           1         79
+cpu090  N-20.02.157     36    187            0             0                                        0  2020-02-01                                           1         79
+cpu209  N-17.12.104     20     94            0             0                                        0  2017-12-01                                           0         41
+cpu210  N-17.12.105     20     94            0             0                                        0  2017-12-01                                           0         41
+cpu211  N-17.12.106     20     94            0             0                                        0  2017-12-01                                           0         41
+cpu212  N-17.12.107     20     94            0             0                                        0  2017-12-01                                           0         41
+cpu213  N-17.12.108     20     94            0             0                                        0  2017-12-01                                           0         41
+cpu226  N-19.01.161     20     94            0             0                                        0  2019-01-01                                           0         41
+cpu227  N-19.01.162     20     94            0             0                                        0  2019-01-01                                           0         41
+cpu228  N-19.01.163     20     94            0             0                                        0  2019-01-01                                           0         41
+cpu229  N-19.01.164     20     94            0             0                                        0  2019-01-01                                           0         41
+cpu277  N-20.11.131    128    503            0             0                                        0  2020-11-01                                          10        251
+gpu002  S-16.12.215     12    251            5             0  NVIDIA TITAN X (Pascal)           12288  2016-12-01                                           0         84
+gpu012  S-16.12.216     24    251            8             0  NVIDIA GeForce RTX 2080 Ti        11264  2016-12-01                                           0        108
+gpu017  S-20.11.146    128    503            8             0  NVIDIA GeForce RTX 3090           24576  2020-11-01                                          10        299
+gpu023  S-21.09.121    128    503            8             0  NVIDIA GeForce RTX 3080           10240  2021-09-01                                          20        283
+gpu024  S-21.09.122    128    503            8             0  NVIDIA GeForce RTX 3080           10240  2021-09-01                                          20        283
+gpu044  S-23.01.148    128    503            8             0  NVIDIA RTX A5000                  24564  2023-01-01                                          36        299
+gpu047  S-23.12.113    128    503            8             0  NVIDIA RTX A5000                  24564  2023-12-01                                          47        299
+gpu049  S-24.10.140    128    384            8             0  NVIDIA GeForce RTX 4090           24564  2024-10-01                                          57        291
+============================================================ Summary ============================================================
+Total CPUs: 1364 Total CPUs memory[GB]: 6059 Total GPUs: 61 Total GPUs memory[MB]: 142300 Billing: 1959 CPUhours per year: 10.30M
+</code>
+How to read the output:
+  * **host**: the hostname of the compute node
+  * **sn**: the serial number of the node
+  * **cpu**: the number of CPUs available in the node
+  * **mem**: the quantity of memory on the node in GB
+  * **gpunumber**: the number of GPU cards on the node
+  * **gpudeleted**: the number of GPU cards out of order
+  * **gpumodel**: the GPU model
+  * **gpumemory**: the GPU memory in MB per GPU card
+  * **purchasedate**: the purchase date of the node
+  * **months remaining in prod. (Jan 2025)**: the number of months the node remains the property of the research group, the reference date is indicated in parenthesis. In this example it is January 2025.
+  * **billing**: the [[hpc:accounting#resource_accounting_uniformization|billing]] value of the compute node
+You can modify the reference year if you want to "simulate" the hardware you'll have in your private partition in a given year. To do so, use the argument ''<nowiki>--reference-year</nowiki>'' of the script.
 ===== Job accounting =====
-If you are interested in your HPC usage, group usage, job wait time, etc., we have the right tools for you.
 ==== OpenXDMoD ====
@@ Line 56: / Line 135: @@
-We wrote a helper that you can use to get your past resource usage on the cluster.
+We wrote a helper that you can use to get your past resource usage on the cluster. This script can display the resource utilization
+  * for each user of a given account (PI)
+  * total usage of a given account (PI)
 <code>
-(baobab)-[sagon@login1 ~]$ ug_slurm_usage_per_user.py --help
+(baobab)-[sagon@login1 ~]$ ug_slurm_usage_per_user.py -h
-usage: ug_slurm_usage_per_user.py [-h] [--user USER] [--start START] [--end END] [--pi PI] [--verbose]
+usage: ug_slurm_usage_per_user.py [-h] [--user USER] [--start START] [--end END] [--pi PI] [--cluster CLUSTER] [--all_users] [--report_type {user,account}] [--time_format TIME_FORMAT] [--verbose]
 Retrieve HPC utilization statistics for a user within a specified time range.
 options:
-  -h, --help     show this help message and exit
+  -h, --help            show this help message and exit
-  --user USER    The username to retrieve utilization for.
+  --user USER           The username to retrieve utilization for.
-  --start START  Start date (default: first day of current month).
+  --start START         Start date (default: first day of current month).
-  --end END      End date (default: current time).
+  --end END             End date (default: current time).
-  --pi PI        Specify the PI manually (optional). If not provided, it will be auto-detected.
+  --pi PI               Specify the PI (account) manually (optional). If not provided, it will be auto-detected.
-  --verbose      Print verbose msgs
+  --cluster CLUSTER     Specify the cluster manually (optional). If not provided, all the clusters will be selected.
+  --all_users           If you want to see utilization of all users of a given account (PI)
+  --report_type {user,account}
+                        Report type: UserUtilizationByAccount or AccountUtilizationByUser
+  --time_format TIME_FORMAT
+                        Specify the time formt for the reporting. Default is by hours. You can use Minutes or Seconds
+  --verbose             Print verbose msgs
 </code>
 By default when you run this script, it will print your past usage of the current month, for all the accounts you are member of.