Differences

This shows you the differences between two versions of the page.

--- hpc:accounting [2025/02/10 16:06] – [Resources available for research group] Yann Sagon
+++ hpc:accounting [2025/12/08 12:49] (current) – Yann Sagon
@@ Line 1: / Line 1: @@
-{{METATOC 1-5}}
+{{METATOC 1-8}}
 ====== Utilization and accounting ======
 When you submit jobs, they are using physical resources such as CPUs, Memory, Network, GPUs, Energy etc. We keep track of the usage of some of those resource. On this page we'll let you know how to consult your usage of the resource. We have several tools that you can use to consult your utilization: sacct, sreport, openxdmod
@@ Line 13: / Line 13: @@
 ===== Resource accounting uniformization =====
-We charge usage uniformly by converting GPU hours and memory usage into CPU hour equivalents, leveraging the [[https://slurm.schedmd.com/tres.html|TRESBillingWeights]] functionality provided by SLURM.
+We apply uniform resource accounting by converting GPU hours and memory usage into CPU-hour equivalents, using the [[https://slurm.schedmd.com/tres.html|TRESBillingWeights]] feature provided by SLURM.
+A CPU hour represents one hour of processing time on a single CPU core.
-A CPU hour represents one hour of processing time by a single CPU core.
+We use this model because our cluster is heterogeneous, and both the computational power and the cost of GPUs vary significantly depending on the model. To ensure fairness and transparency, each GPU type is assigned a weight that reflects its relative performance compared to a CPU core. Similarly, memory usage is converted into CPU-hour equivalents based on predefined weights.
-For GPUs, SLURM assigns a conversion factor to each GPU model through TRESBillingWeights (see below the conversion table), reflecting its computational performance relative to a CPU. Similarly, memory usage is also converted into CPU hour equivalents based on predefined weights, ensuring that jobs consuming significant memory resources are accounted for fairly.
+We also bill memory usage because some jobs consume very little CPU but require large amounts of memory, which means an entire compute node is occupied. This ensures that jobs using significant memory resources are accounted for fairly.
-For example, a job using a GPU with a weight of 10 for 2 hours and memory equivalent to 5 CPU hours would be billed as 25 CPU hours. This approach ensures consistent, transparent, and fair resource accounting across all heterogeneous components of the cluster.
+Example: A job using a GPU with a weight of 10 for 2 hours and memory equivalent to 5 CPU hours would be billed as 25 CPU hours. This approach guarantees consistent, transparent, and fair resource accounting across all heterogeneous components of the cluster.
-You can see the detail of the conversion by looking at the parameter of a random partition on any of the clusters. We are using the same conversion table everywhere.
+You can check the up to date conversion details by inspecting the parameters of any partition on the clusters. The same conversion table is applied everywhere.
 <code>
@@ Line 45: / Line 46: @@
 ===== Resources available for research group =====
+Research groups that have invested in the HPC cluster by purchasing private CPU or GPU nodes benefit from **high-priority access** to these resources.
+Although these nodes remain available to all users, owners receive **priority scheduling** and a predefined annual allocation of compute hours, referred to as [[accounting#resource_accounting_uniformization|billings]].
+The advantage of this approach is flexibility: you are free to use any resource on any cluster, rather than being restricted to your own nodes. When doing so, your billings will be consumed.
-Research groups that have invested in the HPC cluster by purchasing private CPU or GPU nodes benefit from high priority access to these resources.
+To view details of owned resources, users can run the script:
+''ug_getNodeCharacteristicsSummary.py''
+This script provides a summary of the node characteristics within the cluster.
-While these nodes remain available to all users, owners receive priority scheduling and a designated number of included compute hours per year.
+**Note:** This model ensures **fairness** across all users. Even if some groups own nodes, resources remain shared. Usage beyond the included billings will be **charged according to the standard accounting model**, ensuring equitable access for everyone.
-To check the details of their owned resources, users can run the script ''ug_getNodeCharacteristicsSummary.sh'', which provides a summary of the node characteristics within the cluster.
+Output example of the script:
-Example:
 <code>
-ug_getNodeCharacteristicsSummary.sh --partitions private-<group>-gpu private-<group>-cpu --cluster <cluster> --summary
+ug_getNodeCharacteristicsSummary.py --partitions private-<group>-gpu private-<group>-cpu --cluster <cluster> --summary
 host    sn             cpu    mem    gpunumber    gpudeleted  gpumodel                      gpumemory  purchasedate      months remaining in prod. (Jan 2025)    billing
 ------  -----------  -----  -----  -----------  ------------  --------------------------  -----------  --------------  --------------------------------------  ---------
 cpu084  N-20.02.151     36    187            0             0                                        0  2020-02-01                                           1         79
-cpu085  N-20.02.152     36    187            0             0                                        0  2020-02-01                                           1         79
+[...]
-cpu086  N-20.02.153     36    187            0             0                                        0  2020-02-01                                           1         79
-cpu087  N-20.02.154     36    187            0             0                                        0  2020-02-01                                           1         79
 cpu088  N-20.02.155     36    187            0             0                                        0  2020-02-01                                           1         79
-cpu089  N-20.02.156     36    187            0             0                                        0  2020-02-01                                           1         79
+[...]
-cpu090  N-20.02.157     36    187            0             0                                        0  2020-02-01                                           1         79
-cpu209  N-17.12.104     20     94            0             0                                        0  2017-12-01                                           0         41
-cpu210  N-17.12.105     20     94            0             0                                        0  2017-12-01                                           0         41
-cpu211  N-17.12.106     20     94            0             0                                        0  2017-12-01                                           0         41
-cpu212  N-17.12.107     20     94            0             0                                        0  2017-12-01                                           0         41
-cpu213  N-17.12.108     20     94            0             0                                        0  2017-12-01                                           0         41
 cpu226  N-19.01.161     20     94            0             0                                        0  2019-01-01                                           0         41
-cpu227  N-19.01.162     20     94            0             0                                        0  2019-01-01                                           0         41
+[...]
-cpu228  N-19.01.163     20     94            0             0                                        0  2019-01-01                                           0         41
 cpu229  N-19.01.164     20     94            0             0                                        0  2019-01-01                                           0         41
 cpu277  N-20.11.131    128    503            0             0                                        0  2020-11-01                                          10        251
@@ Line 99: / Line 94: @@
   * **purchasedate**: the purchase date of the node
   * **months remaining in prod. (Jan 2025)**: the number of months the node remains the property of the research group, the reference date is indicated in parenthesis. In this example it is January 2025.
-  * **billing**: the billing value of the compute node
+  * **billing**: the [[hpc:accounting#resource_accounting_uniformization|billing]] value of the compute node
+You can modify the reference year if you want to "simulate" the hardware you'll have in your private partition in a given year. To do so, use the argument ''<nowiki>--reference-year</nowiki>'' of the script.
 ===== Job accounting =====
@@ Line 113: / Line 108: @@
 Openxdmod is integrated into our SI. When you connect to it, you'll get the profile "user" and the data are filtered by your user by default. If you are a PI, you can ask us to change your profile to be PI.
+<note important>OpenXDMoD currently supports only CPUh and GPUh metrics, not the [[accounting#resource_accounting_uniformization|billing]] metrics (yet?). For this reason, you need to use [[accounting#report_and_statistics_with_sreport|sreport or our script]] if you want to view the billed metrics.</note>
 ==== sacct ====
 You can see your job history using ''sacct'':
@@ Line 135: / Line 131: @@
-We wrote a helper that you can use to get your past resource usage on the cluster.
+We wrote a helper that you can use to get your past resource usage on the cluster. This script can display the resource utilization
+  * for each user of a given account (PI)
+  * total usage of a given account (PI)
 <code>
-(baobab)-[sagon@login1 ~]$ ug_slurm_usage_per_user.py --help
+(baobab)-[sagon@login1] $ ug_slurm_usage_per_user.py --help
-usage: ug_slurm_usage_per_user.py [-h] [--user USER] [--start START] [--end END] [--pi PI] [--verbose]
+usage: ug_slurm_usage_per_user.py [-h] [--user USER] [--start START] [--end END] [--pi PI] [--group GROUP] [--cluster {baobab,yggdrasil,bamboo}] [--all-users] [--aggregate] [--report-type {user,account}]
+                                  [--time-format {Hours,Minutes,Seconds}] [--verbose]
-Retrieve HPC utilization statistics for a user within a specified time range.
+Retrieve HPC utilization statistics for a user or group of users.
 options:
-  -h, --help     show this help message and exit
+  -h, --help            show this help message and exit
-  --user USER    The username to retrieve utilization for.
+  --user USER           Username to retrieve usage for.
-  --start START  Start date (default: first day of current month).
+  --start START         Start date (default: first of month).
-  --end END      End date (default: current time).
+  --end END             End date (default: now).
-  --pi PI        Specify the PI manually (optional). If not provided, it will be auto-detected.
+  --pi PI               Specify a PI manually.
-  --verbose      Print verbose msgs
+  --group GROUP         Specify a group name to get all PIs belonging to it.
+  --cluster {baobab,yggdrasil,bamboo}
+                        Cluster name (default: all clusters).
+  --all-users           Include all users under the PI account.
+  --aggregate           Aggregate the usage per user.
+  --report-type {user,account}
+                        Type of report: user (default) or account.
+  --time-format {Hours,Minutes,Seconds}
+                        Time format: Hours (default), Minutes, or Seconds.
+  --verbose             Verbose output.
 </code>
 By default when you run this script, it will print your past usage of the current month, for all the accounts you are member of.
+=== Usage details of a given PI ===
+<code>
+(baobab)-[sagon@login1] $ ug_slurm_usage_per_user.py --pi **** --report-type account --start 2025-01-01
+--------------------------------------------------------------------------------
+Cluster/Account/User Utilization 2025-01-01T00:00:00 - 2025-12-08T13:59:59 (29512800 secs)
+Usage reported in TRES Hours
+--------------------------------------------------------------------------------
+Cluster    Login    Proper Name    Account    TRES Name      Used
+---------  -------  -------------  ---------  -----------  ------
+bamboo                             krusek     billing      176681
+baobab                             krusek     billing      961209
+yggdrasil                          krusek     billing           0
+Total usage: 1.14M
+</code>
+=== Usage details of all PIs associated with a private group ===
+Usage example to see the resource usage from the beginning of 2025 for all the PIs and associate users of the group private_xxx. The group private_xxx owns several compute nodes:
+<code>
+(baobab)-[sagon@login1 ~]$ ug_slurm_usage_per_user.py --group private_xxx --start=2025-01-01 --report-type=account
+--------------------------------------------------------------------------------
+Cluster/Account/User Utilization 2025-01-01T00:00:00 - 2025-08-21T14:59:59 (20095200 secs)
+Usage reported in TRES Hours
+--------------------------------------------------------------------------------
+Cluster    Login    Proper Name    Account    TRES Name       Used
+---------  -------  -------------  ---------  -----------  -------
+baobab                             PI1        billing        56134
+yggdrasil                          PI1        billing       105817
+bamboo                             PI2        billing         5416
+baobab                             PI2        billing      1517001
+yggdrasil                          PI2        billing        23775
+bamboo                             PI3        billing            0
+baobab                             PI3        billing      1687963
+yggdrasil                          PI3        billing      1344599
+[...]
+Total usage: 7.36M
+</code>
+=== Aggregate usage by all users of a given PI ===
+<code>
+$ ug_slurm_usage_per_user.py --pi ***** --report-type account --start 2025-01-01 --all-users --aggregate
+--------------------------------------------------------------------------------
+Cluster/Account/User Utilization 2025-01-01T00:00:00 - 2025-12-08T13:59:59 (29512800 secs)
+Usage reported in TRES Hours
+--------------------------------------------------------------------------------
+Login       Used
+--------  ------
+a***u    547746
+d***i    272634
+d***on    91178
+d***l     86860
+e***j     60649
+v***d0    37962
+w***r     29886
+s***o      9120
+k***k      1853
+m***l         1
+Total usage: 1.14M
+</code>