User Tools

Site Tools


hpc:hpc_clusters

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
hpc:hpc_clusters [2025/01/17 08:53] – [Private nodes] Yann Sagonhpc:hpc_clusters [2025/03/14 14:17] (current) – [CPUs on Baobab] Gaël Rossignol
Line 89: Line 89:
 You can find the whole table that you can send to the FNS {{:hpc:hpc:acrobat_2024-04-09_15-58-28.png?linkonly|here}}. You can find the whole table that you can send to the FNS {{:hpc:hpc:acrobat_2024-04-09_15-58-28.png?linkonly|here}}.
  
- +Users of a given PI are entitled to 100k CPU hours per year free of charge (per PI, not per user). See [[hpc:accounting|how to check PI and user past usage]]. 
-==== Cost of Renting Cost a Compute Node ====+==== Cost of Renting a Compute Node ====
  
 The cost of renting a compute node is calculated based on the vendor price of the node, adjusted to account for operational and infrastructure expenses. Specifically, we add 15% to the vendor price to cover additional costs, such as maintenance and administrative overhead. The total cost is then amortized over an estimated 5-year lifespan of the compute node to determine the monthly rental rate. The cost of renting a compute node is calculated based on the vendor price of the node, adjusted to account for operational and infrastructure expenses. Specifically, we add 15% to the vendor price to cover additional costs, such as maintenance and administrative overhead. The total cost is then amortized over an estimated 5-year lifespan of the compute node to determine the monthly rental rate.
Line 102: Line 102:
 Users are entitled to utilize up to 60% of the computational resources they own or rent within the cluster. For example, if you rent a compute node with 128 CPU cores for one year, you will receive a total credit of **128 (cores) × 24 (hours) × 365 (days) × 0.6 (max usage rate) = 672,768 core-hours**. This credit can be used across any of our three clusters -- Bamboo, Baobab, and Yggdrasil -- regardless of where the compute node was rented or purchased. Users are entitled to utilize up to 60% of the computational resources they own or rent within the cluster. For example, if you rent a compute node with 128 CPU cores for one year, you will receive a total credit of **128 (cores) × 24 (hours) × 365 (days) × 0.6 (max usage rate) = 672,768 core-hours**. This credit can be used across any of our three clusters -- Bamboo, Baobab, and Yggdrasil -- regardless of where the compute node was rented or purchased.
  
-The key distinction when using your own resources is that you benefit from a higher scheduling priorityensuring quicker access to computational resources. For more details, please refer to the HPC usage policy or contact the support team.+The main advantage is that you are not restricted to using your private nodesbut can access the three clusters and even the GPUs.
  
-==== Private nodes (renting or buying) ====+We are developing scripts to allow to check the usage and the amount of hours you have the right to use regarding the hardware your group owns.
  
-Research groups can buy "private" nodes to add to our clusters, which means that their research group has a //private partition// with a higher priority to use these specific nodes (less waiting time) and they can run their jobs for a longer time (7 days instead of 4 for public compute nodes).+The key distinction when using your own resources is that you benefit from a higher scheduling priority, ensuring quicker access to computational resources. As well, you are not restricted to using your private nodes, but can access the three clusters and even the GPUs
  
-Rules: +For more details, please contact the HPC support team. 
-  * The compute node is added to the corresponding shared partition, which means that other users can use it when it is not being used by its ownerSee [[hpc/slurm#partitions|partitions]] section to have more details about the integration of your private node in the cluster+ 
-  * In addition to the regular cost of the compute node (see below for an example)we add 15% to the price to cover additional operational costs such as cables, racks, switches, storage, etc+ 
-  * The compute node remains the property of the research group for a period of 5 years. After this time, the node can remain in production but will only be available via public and shared partitions. +===== Purchasing or Renting Private Compute Nodes ===== 
-  * There is three year warranty on the compute node. If the node fails after the warranty period, the research group will be responsible for 100% of the repair costs. If the node fails, you have the option to have it repaired. In order to get a quote, we'll need to send the compute node to the vendor, and the initial cost they'll charge to do a quick diagnostic and make a quote is a maximum of 420 CHF, even if the node can't be repaired (worst case)+ 
-  * The research group doesn'have administrative rights on it+Research groups have the option to purchase or rent "private" compute nodes to expand the resources available in our clusters. This arrangement provides the group with a **private partition**, granting higher priority access to the specified nodes (resulting in reduced wait times) and extended job runtimes of up to **7 days** (compared to 4 days for public compute nodes).   
-  * The compute node is installed and maintained by the HPC team in the same way as the other compute nodes. + 
-  * The HPC team can decide to decommission the node when it is too old, but the hardware will be in production for at least four years.+==== Key Rules and Details ==== 
 + 
 +  * **Shared Integration**: The compute node is added to the corresponding shared partition. Other users may utilize it when the owning group is not using itFor details, refer to the [[hpc/slurm#partitions|partitions]] section
 +  * **Maximum Usage**: Research groups can utilize up to **60% of the node's maximum theoretical computational capacity**This ensures fair access to shared resources. See [[hpc:hpc_clusters#usage_limit|Usage limit]] 
 +  * **Cost**: In addition to the base cost of the compute node, a **15% surcharge** is applied to cover operational expenses such as cables, racks, switches, and storage. 
 +  * **Ownership Period**: The compute node remains the property of the research group for **5 years**. After this period, the node may remain in production but will only be accessible via public and shared partitions. 
 +  * **Warranty and Repairs**: Nodes come with **3-year warranty**. If the node fails after this period, the research group is responsible for **100% of repair costs**Repairing the node involves sending it to the vendor for diagnostics and a quote, with a maximum diagnostic fee of **420 CHF**, even if the node is irreparable
 +  * **Administrative Access**: The research group does not have administrative rights over the node
 +  * **Maintenance**: The HPC team handles the installation and maintenance of the compute node, ensuring it operates consistently with other nodes in the cluster
 +  * **Decommissioning**: The HPC team may decommission the node if it becomes obsolete, but it will remain in production for at least **5 years**. 
 + 
 +==== CPU and GPU server example pricing ====
  
-Please note that you may as well rent private nodes for a minimal duration of 6 months instead of buying it. 
  
 See below the current price of a compute node (without the extra 15% and without VAT) See below the current price of a compute node (without the extra 15% and without VAT)
Line 128: Line 138:
   * ~ 14'442.55 CHF TTC   * ~ 14'442.55 CHF TTC
  
 +
 +  * 2 x 96 Core AMD EPYC 9754 2.4GHz Processor
 +  * 768GB DDR45 4800MHz Memory (24x32GB)
 +  * 100G IB EDR card
 +  * 960GB SSD
 +  * ~ 16'464 CHF TTC
 +
 +Key differences:
 +  * + 9754 has higher memory performance of up to 460.8 GB/s vs 7763 which has 190.73 GB/s
 +  * + 9754 has a bigger cache
 +  * - 9754 is more expensive
 +  * - power consumption is 400W for 9754 vs 240W for 7763 
 +  * - 9754 is more difficult to cool as the inlet temperature for air colling must be 22° max
 === GPU H100 with AMD=== === GPU H100 with AMD===
  
Line 148: Line 171:
  
 If you want to ask a financial contribution from UNIGE you must complete a COINF application : https://www.unige.ch/rectorat/commissions/coinf/appel-a-projets If you want to ask a financial contribution from UNIGE you must complete a COINF application : https://www.unige.ch/rectorat/commissions/coinf/appel-a-projets
 +
 +====== Use Baobab for teaching ======
 +
 +Baobab, our HPC infrastructure, supports educators in providing students with hands-on HPC experience. 
 +
 +Teachers can request access via [dw.unige.ch](final link to be added later, use hpc@unige.ch in the meantime), and once the request is fulfilled, a special account named <PI_NAME>_teach will be created for the instructor. Students must specify this account when submitting jobs for course-related work (e.g., <nowiki>--account=<PI_NAME>_teach</nowiki>). 
 +
 +A shared storage space can also be created optionally, accessible at ''/home/share/<PI_NAME>_teach'' and/or ''/srv/beegfs/scratch/shares/<PI_NAME>_teach''
 +
 +**All student usage is free of charge if they submit their job to the correct account**. 
 +
 +We strongly recommend that teachers use and promote our user-friendly web portal at [[hpc:how_to_use_openondemand|OpenOndDemand]] which supports tools like Matlab, JupyterLab, and more. Baobab helps integrate real-world computational tools into curricula, fostering deeper learning in HPC technologies.
 +
  
 ====== How do I use your clusters ? ====== ====== How do I use your clusters ? ======
Line 231: Line 267:
 | V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | cpu[001-043],gpu[001-002] | 512GB                            | on prod           | | V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | cpu[001-043],gpu[001-002] | 512GB                            | on prod           |
 | V10        | EPYC-72F3 | 3.7GHz  | 16 cores | "Milan" (7 nm)              | cpu[044-045] | 1TB                 |BIG_MEM       | on prod           | | V10        | EPYC-72F3 | 3.7GHz  | 16 cores | "Milan" (7 nm)              | cpu[044-045] | 1TB                 |BIG_MEM       | on prod           |
 +| V10        | EPYC-7763 | 2.45GHz | 128 cores| "Milan" (7 nm)              | cpu[046-048] | 512GB                       | on prod           |
 | V8         | EPYC-7302P| 3.0GHz  | 16 cores | "Rome" (7 nm)              | gpu003 | 512GB                      | on prod           | | V8         | EPYC-7302P| 3.0GHz  | 16 cores | "Rome" (7 nm)              | gpu003 | 512GB                      | on prod           |
 === GPUs on Bamboo === === GPUs on Bamboo ===
Line 249: Line 286:
 | V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[009-010,012-018,020-025,029-044]              |                | decommissioned in 2023       | | V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[009-010,012-018,020-025,029-044]              |                | decommissioned in 2023       |
 | V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[011,019,026-028,042]                          |                | decommissioned in 2024       | | V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[011,019,026-028,042]                          |                | decommissioned in 2024       |
-| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[001-005,007-008,045-056,058]                  |                | to be decommissioned in 2024 | +| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[001-005,007-008,045-056,058]                  |                | decommissioned in 2024       
-| V3         | E5-2670V0 | 2.60GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[059,061-062]                                  |                | to be decommissioned in 2024 | +| V3         | E5-2670V0 | 2.60GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | cpu[059,061-062]                                  |                | decommissioned in 2024       
-| V3         | E5-4640V0 | 2.40GHz | 32 cores | "Sandy Bridge-EP" (32 nm)  | cpu[186]                                          |                | to be decommissioned in 2024 | +| V3         | E5-4640V0 | 2.40GHz | 32 cores | "Sandy Bridge-EP" (32 nm)  | cpu[186]                                          |                | decommissioned in 2024       
-| V4         | E5-2650V2 | 2.60GHz | 16 cores | "Ivy Bridge-EP" (22 nm)    | cpu[063-066,154-172]                              |                | to be decommissioned in 2025 | +| V4         | E5-2650V2 | 2.60GHz | 16 cores | "Ivy Bridge-EP" (22 nm)    | cpu[063-066,154-172]                              |                | decommissioned in 2025 | 
-| V5         | E5-2643V3 | 3.40GHz | 12 cores | "Haswell-EP" (22 nm)       cpu[202,216-217] gpu[002]                                      |                | on prod                      | +| V5         | E5-2643V3 | 3.40GHz | 12 cores | "Haswell-EP" (22 nm)       | gpu[002]                                          |                | on prod                      | 
-| V6         | E5-2630V4 | 2.20GHz | 20 cores | "Broadwell-EP" (14 nm)     | cpu[173-185,187-201,205-213],gpu[004-010]                      |                | on prod                      | +| V6         | E5-2630V4 | 2.20GHz | 20 cores | "Broadwell-EP" (14 nm)     | cpu[173-185,187-201,205-213,220-229,237-264],gpu[004-010]         |                | on prod                      | 
 | V6         | E5-2637V4 | 3.50GHz | 8 cores  | "Broadwell-EP" (14 nm)     | cpu[218-219]                                      | HIGH_FREQUENCY | on prod                      | | V6         | E5-2637V4 | 3.50GHz | 8 cores  | "Broadwell-EP" (14 nm)     | cpu[218-219]                                      | HIGH_FREQUENCY | on prod                      |
 | V6         | E5-2643V4 | 3.40GHz | 12 cores | "Broadwell-EP" (14 nm)     | cpu[202,204,216-217]                              | HIGH_FREQUENCY | on prod                      | | V6         | E5-2643V4 | 3.40GHz | 12 cores | "Broadwell-EP" (14 nm)     | cpu[202,204,216-217]                              | HIGH_FREQUENCY | on prod                      |
Line 260: Line 297:
 | V7         | EPYC-7601 | 2.20GHz | 64 cores | "Naples" (14 nm)           | gpu[011]                                          |                | on prod                      | | V7         | EPYC-7601 | 2.20GHz | 64 cores | "Naples" (14 nm)           | gpu[011]                                          |                | on prod                      |
 | V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | cpu[273-277,285-307,312-335],gpu[013-046]                        | on prod                      | | V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | cpu[273-277,285-307,312-335],gpu[013-046]                        | on prod                      |
-| V9         | GOLD-6240 | 2.60GHz | 36 cores | "Cascade Lake" (14 nm)     | cpu[084-090,265-272,278-284,308-311]                      |                | on prod                      | +| V9         | GOLD-6240 | 2.60GHz | 36 cores | "Cascade Lake" (14 nm)     | cpu[084-090,265-272,278-284,308-311]              |                | on prod                      | 
-| V10         | EPYC-7763 | 2.45GHz | 128 cores| "Milan" (7 nm)            | gpu[047,048]                                          |                | on prod                      | +| V10        | EPYC-7763 | 2.45GHz | 128 cores| "Milan" (7 nm)             | gpu[047,048]                                      |                | on prod                      | 
-| V11         | EPYC-9554 | 3.10GHz | 128 cores| "Genoa" (5 nm)            | gpu[049]                                          |                | on prod                      |+| V11        | EPYC-9554 | 3.10GHz | 128 cores| "Genoa" (5 nm)             | gpu[049]                                          |                | on prod                      |
  
  
hpc/hpc_clusters.1737104023.txt.gz · Last modified: 2025/01/17 08:53 by Yann Sagon