Differences

This shows you the differences between two versions of the page.

--- hpc:hpc_clusters [2025/01/17 08:53] – [Private nodes] Yann Sagon
+++ hpc:hpc_clusters [2025/03/14 14:17] (current) – [CPUs on Baobab] Gaël Rossignol
@@ Line 89: / Line 89: @@
 You can find the whole table that you can send to the FNS {{:hpc:hpc:acrobat_2024-04-09_15-58-28.png?linkonly|here}}.
+Users of a given PI are entitled to 100k CPU hours per year free of charge (per PI, not per user). See [[hpc:accounting|how to check PI and user past usage]].
-==== Cost of Renting Cost a Compute Node ====
+==== Cost of Renting a Compute Node ====
 The cost of renting a compute node is calculated based on the vendor price of the node, adjusted to account for operational and infrastructure expenses. Specifically, we add 15% to the vendor price to cover additional costs, such as maintenance and administrative overhead. The total cost is then amortized over an estimated 5-year lifespan of the compute node to determine the monthly rental rate.
@@ Line 102: / Line 102: @@
 Users are entitled to utilize up to 60% of the computational resources they own or rent within the cluster. For example, if you rent a compute node with 128 CPU cores for one year, you will receive a total credit of **128 (cores) × 24 (hours) × 365 (days) × 0.6 (max usage rate) = 672,768 core-hours**. This credit can be used across any of our three clusters -- Bamboo, Baobab, and Yggdrasil -- regardless of where the compute node was rented or purchased.
-The key distinction when using your own resources is that you benefit from a higher scheduling priority, ensuring quicker access to computational resources. For more details, please refer to the HPC usage policy or contact the support team.
+The main advantage is that you are not restricted to using your private nodes, but can access the three clusters and even the GPUs.
-==== Private nodes (renting or buying) ====
+We are developing scripts to allow to check the usage and the amount of hours you have the right to use regarding the hardware your group owns.
-Research groups can buy "private" nodes to add to our clusters, which means that their research group has a //private partition// with a higher priority to use these specific nodes (less waiting time) and they can run their jobs for a longer time (7 days instead of 4 for public compute nodes).
+The key distinction when using your own resources is that you benefit from a higher scheduling priority, ensuring quicker access to computational resources. As well, you are not restricted to using your private nodes, but can access the three clusters and even the GPUs.
-Rules:
+For more details, please contact the HPC support team.
-  * The compute node is added to the corresponding shared partition, which means that other users can use it when it is not being used by its owner. See [[hpc/slurm#partitions|partitions]] section to have more details about the integration of your private node in the cluster.
-  * In addition to the regular cost of the compute node (see below for an example), we add 15% to the price to cover additional operational costs such as cables, racks, switches, storage, etc.
-  * The compute node remains the property of the research group for a period of 5 years. After this time, the node can remain in production but will only be available via public and shared partitions.
+===== Purchasing or Renting Private Compute Nodes =====
-  * There is a three year warranty on the compute node. If the node fails after the warranty period, the research group will be responsible for 100% of the repair costs. If the node fails, you have the option to have it repaired. In order to get a quote, we'll need to send the compute node to the vendor, and the initial cost they'll charge to do a quick diagnostic and make a quote is a maximum of 420 CHF, even if the node can't be repaired (worst case).
-  * The research group doesn't have administrative rights on it.
+Research groups have the option to purchase or rent "private" compute nodes to expand the resources available in our clusters. This arrangement provides the group with a **private partition**, granting higher priority access to the specified nodes (resulting in reduced wait times) and extended job runtimes of up to **7 days** (compared to 4 days for public compute nodes).
-  * The compute node is installed and maintained by the HPC team in the same way as the other compute nodes.
-  * The HPC team can decide to decommission the node when it is too old, but the hardware will be in production for at least four years.
+==== Key Rules and Details ====
+  * **Shared Integration**: The compute node is added to the corresponding shared partition. Other users may utilize it when the owning group is not using it. For details, refer to the [[hpc/slurm#partitions|partitions]] section.
+  * **Maximum Usage**: Research groups can utilize up to **60% of the node's maximum theoretical computational capacity**. This ensures fair access to shared resources. See [[hpc:hpc_clusters#usage_limit|Usage limit]]
+  * **Cost**: In addition to the base cost of the compute node, a **15% surcharge** is applied to cover operational expenses such as cables, racks, switches, and storage.
+  * **Ownership Period**: The compute node remains the property of the research group for **5 years**. After this period, the node may remain in production but will only be accessible via public and shared partitions.
+  * **Warranty and Repairs**: Nodes come with a **3-year warranty**. If the node fails after this period, the research group is responsible for **100% of repair costs**. Repairing the node involves sending it to the vendor for diagnostics and a quote, with a maximum diagnostic fee of **420 CHF**, even if the node is irreparable.
+  * **Administrative Access**: The research group does not have administrative rights over the node.
+  * **Maintenance**: The HPC team handles the installation and maintenance of the compute node, ensuring it operates consistently with other nodes in the cluster.
+  * **Decommissioning**: The HPC team may decommission the node if it becomes obsolete, but it will remain in production for at least **5 years**.
+==== CPU and GPU server example pricing ====
-Please note that you may as well rent private nodes for a minimal duration of 6 months instead of buying it.
 See below the current price of a compute node (without the extra 15% and without VAT)
@@ Line 128: / Line 138: @@
   * ~ 14'442.55 CHF TTC
+  * 2 x 96 Core AMD EPYC 9754 2.4GHz Processor
+  * 768GB DDR45 4800MHz Memory (24x32GB)
+  * 100G IB EDR card
+  * 960GB SSD
+  * ~ 16'464 CHF TTC
+Key differences:
+  * + 9754 has higher memory performance of up to 460.8 GB/s vs 7763 which has 190.73 GB/s
+  * + 9754 has a bigger cache
+  * - 9754 is more expensive
+  * - power consumption is 400W for 9754 vs 240W for 7763
+  * - 9754 is more difficult to cool as the inlet temperature for air colling must be 22° max
 === GPU H100 with AMD===
@@ Line 148: / Line 171: @@
 If you want to ask a financial contribution from UNIGE you must complete a COINF application : https://www.unige.ch/rectorat/commissions/coinf/appel-a-projets
+====== Use Baobab for teaching ======
+Baobab, our HPC infrastructure, supports educators in providing students with hands-on HPC experience.
+Teachers can request access via [dw.unige.ch](final link to be added later, use hpc@unige.ch in the meantime), and once the request is fulfilled, a special account named <PI_NAME>_teach will be created for the instructor. Students must specify this account when submitting jobs for course-related work (e.g., <nowiki>--account=<PI_NAME>_teach</nowiki>).
+A shared storage space can also be created optionally, accessible at ''/home/share/<PI_NAME>_teach'' and/or ''/srv/beegfs/scratch/shares/<PI_NAME>_teach''.
+**All student usage is free of charge if they submit their job to the correct account**.
+We strongly recommend that teachers use and promote our user-friendly web portal at [[hpc:how_to_use_openondemand|OpenOndDemand]] which supports tools like Matlab, JupyterLab, and more. Baobab helps integrate real-world computational tools into curricula, fostering deeper learning in HPC technologies.
 ====== How do I use your clusters ? ======
@@ Line 231: / Line 267: @@
 | V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | cpu[001-043],gpu[001-002] | 512GB               |              | on prod           |
 | V10        | EPYC-72F3 | 3.7GHz  | 16 cores | "Milan" (7 nm)              | cpu[044-045] | 1TB                 |BIG_MEM       | on prod           |
+| V10        | EPYC-7763 | 2.45GHz | 128 cores| "Milan" (7 nm)              | cpu[046-048] | 512GB                 |       | on prod           |
 | V8         | EPYC-7302P| 3.0GHz  | 16 cores | "Rome" (7 nm)              | gpu003 | 512GB                 |      | on prod           |
 === GPUs on Bamboo ===
@@ Line 249: / Line 286: @@
 | V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[009-010,012-018,020-025,029-044]              |                | decommissioned in 2023       |
 | V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[011,019,026-028,042]                          |                | decommissioned in 2024       |
-| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[001-005,007-008,045-056,058]                  |                | to be decommissioned in 2024 |
+| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[001-005,007-008,045-056,058]                  |                | decommissioned in 2024       |
-| V3         | E5-2670V0 | 2.60GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[059,061-062]                                  |                | to be decommissioned in 2024 |
+| V3         | E5-2670V0 | 2.60GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[059,061-062]                                  |                | decommissioned in 2024       |
-| V3         | E5-4640V0 | 2.40GHz | 32 cores | "Sandy Bridge-EP" (32 nm)  | cpu[186]                                          |                | to be decommissioned in 2024 |
+| V3         | E5-4640V0 | 2.40GHz | 32 cores | "Sandy Bridge-EP" (32 nm)  | cpu[186]                                          |                | decommissioned in 2024       |
-| V4         | E5-2650V2 | 2.60GHz | 16 cores | "Ivy Bridge-EP" (22 nm)    | cpu[063-066,154-172]                              |                | to be decommissioned in 2025 |
+| V4         | E5-2650V2 | 2.60GHz | 16 cores | "Ivy Bridge-EP" (22 nm)    | cpu[063-066,154-172]                              |                | decommissioned in 2025 |
-| V5         | E5-2643V3 | 3.40GHz | 12 cores | "Haswell-EP" (22 nm)       | cpu[202,216-217] gpu[002]                                      |                | on prod                      |
+| V5         | E5-2643V3 | 3.40GHz | 12 cores | "Haswell-EP" (22 nm)       | gpu[002]                                          |                | on prod                      |
-| V6         | E5-2630V4 | 2.20GHz | 20 cores | "Broadwell-EP" (14 nm)     | cpu[173-185,187-201,205-213],gpu[004-010]                      |                | on prod                      |
+| V6         | E5-2630V4 | 2.20GHz | 20 cores | "Broadwell-EP" (14 nm)     | cpu[173-185,187-201,205-213,220-229,237-264],gpu[004-010]         |                | on prod                      |
 | V6         | E5-2637V4 | 3.50GHz | 8 cores  | "Broadwell-EP" (14 nm)     | cpu[218-219]                                      | HIGH_FREQUENCY | on prod                      |
 | V6         | E5-2643V4 | 3.40GHz | 12 cores | "Broadwell-EP" (14 nm)     | cpu[202,204,216-217]                              | HIGH_FREQUENCY | on prod                      |
@@ Line 260: / Line 297: @@
 | V7         | EPYC-7601 | 2.20GHz | 64 cores | "Naples" (14 nm)           | gpu[011]                                          |                | on prod                      |
 | V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | cpu[273-277,285-307,312-335],gpu[013-046]         |                | on prod                      |
-| V9         | GOLD-6240 | 2.60GHz | 36 cores | "Cascade Lake" (14 nm)     | cpu[084-090,265-272,278-284,308-311]                      |                | on prod                      |
+| V9         | GOLD-6240 | 2.60GHz | 36 cores | "Cascade Lake" (14 nm)     | cpu[084-090,265-272,278-284,308-311]              |                | on prod                      |
-| V10         | EPYC-7763 | 2.45GHz | 128 cores| "Milan" (7 nm)            | gpu[047,048]                                          |                | on prod                      |
+| V10        | EPYC-7763 | 2.45GHz | 128 cores| "Milan" (7 nm)             | gpu[047,048]                                      |                | on prod                      |
-| V11         | EPYC-9554 | 3.10GHz | 128 cores| "Genoa" (5 nm)            | gpu[049]                                          |                | on prod                      |
+| V11        | EPYC-9554 | 3.10GHz | 128 cores| "Genoa" (5 nm)             | gpu[049]                                          |                | on prod                      |