hpc:accounting
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
hpc:accounting [2023/11/10 14:29] – created Yann Sagon | hpc:accounting [2025/03/13 09:57] (current) – [Report and statistics with sreport] Yann Sagon | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | {{METATOC 1-5}} | ||
====== Utilization and accounting ====== | ====== Utilization and accounting ====== | ||
When you submit jobs, they are using physical resources such as CPUs, Memory, Network, GPUs, Energy etc. We keep track of the usage of some of those resource. On this page we'll let you know how to consult your usage of the resource. We have several tools that you can use to consult your utilization: | When you submit jobs, they are using physical resources such as CPUs, Memory, Network, GPUs, Energy etc. We keep track of the usage of some of those resource. On this page we'll let you know how to consult your usage of the resource. We have several tools that you can use to consult your utilization: | ||
+ | |||
+ | |||
+ | ===== Comparison of sreport, sacct, and sshare ===== | ||
+ | We use **sreport** as our primary accounting reference. However, you may find other tools useful for specific purposes. Here's a comparison: | ||
+ | |||
+ | * **sacct**: Displays only account jobs, excluding time allocated via reservations. If duplicate jobs exist, only one is shown. | ||
+ | * **sreport**: | ||
+ | * **sshare**: Not recommended for accounting purposes; displayed values are adjusted based on fairshare calculations. | ||
+ | |||
+ | ===== Resource accounting uniformization ===== | ||
+ | |||
+ | We charge usage uniformly by converting GPU hours and memory usage into CPU hour equivalents, | ||
+ | |||
+ | A CPU hour represents one hour of processing time by a single CPU core. | ||
+ | |||
+ | For GPUs, SLURM assigns a conversion factor to each GPU model through TRESBillingWeights (see below the conversion table), reflecting its computational performance relative to a CPU. Similarly, memory usage is also converted into CPU hour equivalents based on predefined weights, ensuring that jobs consuming significant memory resources are accounted for fairly. | ||
+ | |||
+ | For example, a job using a GPU with a weight of 10 for 2 hours and memory equivalent to 5 CPU hours would be billed as 25 CPU hours. This approach ensures consistent, transparent, | ||
+ | |||
+ | You can see the detail of the conversion by looking at the parameter of a random partition on any of the clusters. We are using the same conversion table everywhere. | ||
+ | |||
+ | < | ||
+ | (bamboo)-[root@slurm1 ~]$ scontrol show partition debug-cpu | grep TRESBillingWeights | tr "," | ||
+ | | ||
+ | Mem=0.25G | ||
+ | GRES/gpu=1 | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | GRES/ | ||
+ | </ | ||
+ | |||
+ | Here you can see for example that using a gpu nvidia_a100-pcie-40gb for 1 hour is equivalent in term of cost to use 5 CPUhour. | ||
+ | |||
+ | ===== Resources available for research group ===== | ||
+ | |||
+ | |||
+ | |||
+ | Research groups that have invested in the HPC cluster by purchasing private CPU or GPU nodes benefit from high priority access to these resources. | ||
+ | |||
+ | While these nodes remain available to all users, owners receive priority scheduling and a designated number of included compute hours per year. | ||
+ | |||
+ | To check the details of their owned resources, users can run the script '' | ||
+ | |||
+ | Example: | ||
+ | < | ||
+ | ug_getNodeCharacteristicsSummary.sh --partitions private-< | ||
+ | host sn | ||
+ | ------ | ||
+ | cpu084 | ||
+ | cpu085 | ||
+ | cpu086 | ||
+ | cpu087 | ||
+ | cpu088 | ||
+ | cpu089 | ||
+ | cpu090 | ||
+ | cpu209 | ||
+ | cpu210 | ||
+ | cpu211 | ||
+ | cpu212 | ||
+ | cpu213 | ||
+ | cpu226 | ||
+ | cpu227 | ||
+ | cpu228 | ||
+ | cpu229 | ||
+ | cpu277 | ||
+ | gpu002 | ||
+ | gpu012 | ||
+ | gpu017 | ||
+ | gpu023 | ||
+ | gpu024 | ||
+ | gpu044 | ||
+ | gpu047 | ||
+ | gpu049 | ||
+ | |||
+ | ============================================================ Summary ============================================================ | ||
+ | Total CPUs: 1364 Total CPUs memory[GB]: 6059 Total GPUs: 61 Total GPUs memory[MB]: 142300 Billing: 1959 CPUhours per year: 10.30M | ||
+ | </ | ||
+ | |||
+ | How to read the output: | ||
+ | * **host**: the hostname of the compute node | ||
+ | * **sn**: the serial number of the node | ||
+ | * **cpu**: the number of CPUs available in the node | ||
+ | * **mem**: the quantity of memory on the node in GB | ||
+ | * **gpunumber**: | ||
+ | * **gpudeleted**: | ||
+ | * **gpumodel**: | ||
+ | * **gpumemory**: | ||
+ | * **purchasedate**: | ||
+ | * **months remaining in prod. (Jan 2025)**: the number of months the node remains the property of the research group, the reference date is indicated in parenthesis. In this example it is January 2025. | ||
+ | * **billing**: | ||
+ | |||
+ | You can modify the reference year if you want to " | ||
+ | |||
+ | ===== Job accounting ===== | ||
+ | ==== OpenXDMoD ==== | ||
+ | |||
+ | We track the job usage of our clusters here: https:// | ||
+ | |||
+ | We have a tutorial explaining some of the features: [[https:// | ||
+ | |here]] | ||
+ | |||
+ | Openxdmod is integrated into our SI. When you connect to it, you'll get the profile " | ||
+ | |||
+ | ==== sacct ==== | ||
+ | You can see your job history using '' | ||
+ | |||
+ | < | ||
+ | [sagon@master ~] $ sacct -u $USER -S 2021-04-01 | ||
+ | | ||
+ | ------------ ---------- ---------- ---------- ---------- ---------- -------- | ||
+ | 45517641 | ||
+ | 45517641.ba+ | ||
+ | 45517641.ex+ | ||
+ | 45517641.0 | ||
+ | 45518119 | ||
+ | 45518119.ba+ | ||
+ | 45518119.ex+ | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Report and statistics with sreport ==== | ||
+ | |||
+ | To get reporting about your past jobs, you can use '' | ||
+ | |||
+ | |||
+ | We wrote a helper that you can use to get your past resource usage on the cluster. This script can display the resource utilization | ||
+ | * for each user of a given account (PI) | ||
+ | * total usage of a given account (PI) | ||
+ | |||
+ | < | ||
+ | (baobab)-[sagon@login1 ~]$ ug_slurm_usage_per_user.py -h | ||
+ | usage: ug_slurm_usage_per_user.py [-h] [--user USER] [--start START] [--end END] [--pi PI] [--cluster CLUSTER] [--all_users] [--report_type {user, | ||
+ | |||
+ | Retrieve HPC utilization statistics for a user within a specified time range. | ||
+ | |||
+ | options: | ||
+ | -h, --help | ||
+ | --user USER The username to retrieve utilization for. | ||
+ | --start START Start date (default: first day of current month). | ||
+ | --end END End date (default: current time). | ||
+ | --pi PI | ||
+ | --cluster CLUSTER | ||
+ | --all_users | ||
+ | --report_type {user, | ||
+ | Report type: UserUtilizationByAccount or AccountUtilizationByUser | ||
+ | --time_format TIME_FORMAT | ||
+ | Specify the time formt for the reporting. Default is by hours. You can use Minutes or Seconds | ||
+ | --verbose | ||
+ | </ | ||
+ | |||
+ | By default when you run this script, it will print your past usage of the current month, for all the accounts you are member of. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === sreport examples === | ||
+ | |||
+ | Here are some examples that can give you a starting point : | ||
+ | |||
+ | To get the number of jobs you ran (you <=> '' | ||
+ | |||
+ | <code console> | ||
+ | [brero@login2 ~]$ sreport job sizesbyaccount user=$USER PrintJobCount start=2018-01-01 end=2019-01-01 | ||
+ | |||
+ | -------------------------------------------------------------------------------- | ||
+ | Job Sizes 2018-01-01T00: | ||
+ | Units are in number of jobs ran | ||
+ | -------------------------------------------------------------------------------- | ||
+ | Cluster | ||
+ | --------- --------- ------------- ------------- ------------- ------------- ------------- ------------ | ||
+ | | ||
+ | </ | ||
+ | |||
+ | You can see how many jobs were run (grouped by allocated CPU). You can also see we specified an extra day for the //end date// '' | ||
+ | < | ||
+ | |||
+ | You can also check how much CPU time (seconds) you have used on the cluster between since 2019-09-01 : | ||
+ | |||
+ | <code console> | ||
+ | [brero@login2 ~]$ sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 -t Seconds | ||
+ | -------------------------------------------------------------------------------- | ||
+ | Cluster/ | ||
+ | Usage reported in CPU Seconds | ||
+ | -------------------------------------------------------------------------------- | ||
+ | Cluster | ||
+ | --------- --------------- --------- --------------- -------- -------- | ||
+ | | ||
+ | </ | ||
+ | |||
+ | In this example, we added the time '' | ||
+ | |||
+ | Please note : | ||
+ | * By default, the CPU time is in Minutes | ||
+ | * It takes up to an hour for Slurm to upate this information in its database, so be patient | ||
+ | * If you don't specify a start, nor an end date, yesterday' | ||
+ | * The CPU time is the time that was allocated to you. It doesn' | ||
+ | |||
+ | Tip : If you absolutely need a report including your job that ran on the same day, you can override the default end date by forcing tomorrow' | ||
+ | |||
+ | < | ||
+ | sreport cluster AccountUtilizationByUser user=$USER start=2019-09-01 end=$(date --date=" | ||
+ | </ | ||
+ | |||
hpc/accounting.1699626583.txt.gz · Last modified: 2023/11/10 14:29 by Yann Sagon