User Tools

Site Tools


hpc:hpc_clusters

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
hpc:hpc_clusters [2023/05/09 15:01]
Gaël Rossignol
hpc:hpc_clusters [2024/04/09 16:07] (current)
Yann Sagon [Price per hour]
Line 1: Line 1:
 +{{METATOC 1-5}}
 +
 ====== How our clusters work ====== ====== How our clusters work ======
  
Line 27: Line 29:
  
 ^ cluster name ^ datacentre ^ Interconnect ^ public CPU ^ public GPU ^ Total CPU size ^ Total GPU size ^ ^ cluster name ^ datacentre ^ Interconnect ^ public CPU ^ public GPU ^ Total CPU size ^ Total GPU size ^
-| Baobab       | Dufour     | IB 40GB QDR  | ~900       | 0          | ~4200          | 95             +| Baobab       | Dufour     | IB 40GB QDR  | ~900       | 0          | ~9'736           | 271          | 
-| Yggdrasil    | Astro      | IB 100GB EDR | ~3000      | 44         | ~4308          | 52             |+| Yggdrasil    | Astro      | IB 100GB EDR | ~3000      | 44         | ~8'228           | 52           |
  
  
Line 45: Line 47:
  
 All those servers (login, compute, management and storage nodes) : All those servers (login, compute, management and storage nodes) :
-  * run with the GNU/Linux distribution [[https://www.centos.org/|CentOS]].+  * run with the GNU/Linux distribution [[https://rockylinux.org/|Rocky]].
   * are inter-connected on high speed InfiniBand network   * are inter-connected on high speed InfiniBand network
     * 40Gbit/s (QDR) for Baobab.     * 40Gbit/s (QDR) for Baobab.
Line 67: Line 69:
 that will use only CPU or GPU nodes.  that will use only CPU or GPU nodes. 
  
-===== Private nodes ===== +===== Cost model ===== 
 + 
 +<note important>**Important update, draft preview.** 
 +We are currently in the process of implementing changes to the investment approach for the HPC service Baobab, wherein research groups will no longer purchase physical nodes as their property. Instead, they will have the option to pay for a share and duration of usage. This new approach offers several advantages for both the research groups and us as the service provider. 
 + 
 +For research groups, the main advantage lies in the increased flexibility it provides. They can now tailor their investments to suit the specific needs of their projects, scaling their usage as required. This eliminates the constraints of owning physical nodes and allows for more efficient allocation of resources. 
 + 
 +As the service provider, we benefit from this new investment model as well. We can now strategically purchase compute nodes and hardware based on the actual demand from research groups, ensuring that our investments align with their usage patterns. This allows us to optimize resource utilization and make timely acquisitions when needed. 
 + 
 +In cases where research groups have already purchased compute nodes, we offer them the opportunity to convert their ownership into credits for shares. We estimate that a compute node typically lasts for at least 6 years under normal conditions, and this conversion option ensures that the value of their existing investment is not lost. 
 +</note> 
 + 
 +==== Price per hour ==== 
 +Overview: 
 +{{:hpc:pasted:20240404-092421.png}} 
 + 
 +You can find the whole table that you can send to the FNS {{:hpc:hpc:acrobat_2024-04-09_15-58-28.png?linkonly|here}}. 
 + 
 +==== Private nodes ====
  
 Research groups can buy "private" nodes to add in our clusters, which means their research group has a  Research groups can buy "private" nodes to add in our clusters, which means their research group has a 
Line 75: Line 95:
 Rules: Rules:
   * The compute node remains the research group property   * The compute node remains the research group property
-  * The compute node has a three years warranty. If it fails after the warranty expiration, the repair cost is to be paid at 100% by the research group+  * There is a three-year warranty on the compute node. If there is a failure after the warranty period100% of the repair costs will be the responsibility of the research group. If the node is out of order, you have the option to have it repaired. In order to get a quote, we need to send the compute node to the vendor and the initial cost they will charge to do a quick diagnostic and make a quote is a maximum of 420 CHF, even if the node can't be repaired (worst case).
   * The research groups hasn't an admin right on it   * The research groups hasn't an admin right on it
   * The compute node is installed and maintained by the HPC team in the same way as the other compute nodes   * The compute node is installed and maintained by the HPC team in the same way as the other compute nodes
   * The HPC team can decide to decommission the node when it is too old but the hardware will be in production for at least four years   * The HPC team can decide to decommission the node when it is too old but the hardware will be in production for at least four years
 +  * 
 +
 +
  
 See the [[hpc/slurm#partitions|partitions]] section to have more details about the integration of your private node in the cluster. See the [[hpc/slurm#partitions|partitions]] section to have more details about the integration of your private node in the cluster.
Line 180: Line 203:
  
 See [[hpc::slurm#gpgpu_jobs|here]] how to request GPU for your jobs. See [[hpc::slurm#gpgpu_jobs|here]] how to request GPU for your jobs.
 +
 +
 +==== Bamboo (coming soon) ====
 +
 +^ Generation ^ Model     ^ Freq    ^ Nb cores ^ Architecture               ^ Nodes         ^ Memory              ^Extra flag    ^ Status            ^
 +| V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | node[001-043] | 512GB                            | to be installed   |
 +| V8         | EPYC-72F3 | 3.7GHz  | 16 cores | "Rome" (7 nm)              | node[044-045] | 1TB                 |BIG_MEM       | to be installed   |
 +
 +
 +^ GPU model   ^ Architecture ^ Mem  ^ Compute Capability ^ Slurm resource ^ Nb per node ^ Nodes            ^ Peer access between GPUs ^
 +| RTX 3090    | Ampere       | 25GB | 8.6                | ampere         | 8           | gpu[001,002]     | NO                       |
 +| A100        | Ampere       | 80GB | 8.0                | amper          | 4           | gpu[003]         | YES                      |
 +
  
 ==== Baobab ==== ==== Baobab ====
Line 187: Line 223:
 Since our clusters are regularly expanded, the nodes are not all from the same generation. You can see the details in the following table. Since our clusters are regularly expanded, the nodes are not all from the same generation. You can see the details in the following table.
  
-^ Generation ^ Model     ^ Freq    ^ Nb cores ^ Architecture               ^ Nodes                              ^Extra flag    ^ Status  +^ Generation ^ Model     ^ Freq    ^ Nb cores ^ Architecture               ^ Nodes                                              ^Extra flag      ^ Status                       | 
-| V2         | X5650     | 2.67GHz | 12 cores | "Westmere-EP" (32 nm)      | node[093-101,103-111,140-153                    | decommissioned               | +| V2         | X5650     | 2.67GHz | 12 cores | "Westmere-EP" (32 nm)      | node[093-101,103-111,140-153                                      | decommissioned               
-| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[001-005,007-023,025-056,058]                       | to be decommissioned in 2022 | +| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[009-010,012-018,020-025,029-044]              |                | decommissioned in 2023       
-| V3         | E5-2670V0 | 2.60GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[059,061-062]                               | to be decommissioned in 2022 | +| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[001-005,007-008,011,019,026-028,045-056,058]                 | to be decommissioned in 2022 | 
-| V3         | E5-4640V0 | 2.40GHz | 32 cores | "Sandy Bridge-EP" (32 nm)  | node[186]                                       | to be decommissioned in 2022 | +| V3         | E5-2670V0 | 2.60GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[059,061-062]                                                 | to be decommissioned in 2022 | 
-| V4         | E5-2650V2 | 2.60GHz | 16 cores | "Ivy Bridge-EP" (22 nm)    | node[063-066,154-172]                           | to be decommissioned in 2022 | +| V3         | E5-4640V0 | 2.40GHz | 32 cores | "Sandy Bridge-EP" (32 nm)  | node[186]                                                         | to be decommissioned in 2022 | 
-| V5         | E5-2643V3 | 3.40GHz | 12 cores | "Haswell-EP" (22 nm)       | gpu[002,012]                                    | on prod           +| V4         | E5-2650V2 | 2.60GHz | 16 cores | "Ivy Bridge-EP" (22 nm)    | node[063-066,154-172]                                             | to be decommissioned in 2022 | 
-| V6         | E5-2630V4 | 2.20GHz | 20 cores | "Broadwell-EP" (14 nm)     | node[173-185,187-201,205-213]                   | on prod           +| V5         | E5-2643V3 | 3.40GHz | 12 cores | "Haswell-EP" (22 nm)       | gpu[002,012]                                                      | on prod                      
-|  :::        :::      |  :::    |  :::      :::                       | gpu[004-010]                       | :::          | on prod           +| V6         | E5-2630V4 | 2.20GHz | 20 cores | "Broadwell-EP" (14 nm)     | node[173-185,187-201,205-213]                                     | on prod                      |  
-| V6         | E5-2637V4 | 3.50GHz | 8 cores  | "Broadwell-EP" (14 nm)     | node[218-219]                      |HIGH_FREQUENCY| on prod           +|  :::        :::      |  :::    |  :::      :::                       | gpu[004-010]                                       | :::            | on prod                      
-| V6         | E5-2643V4 | 3.40GHz | 12 cores | "Broadwell-EP" (14 nm)     | node[202,204,216-217]              |HIGH_FREQUENCY| on prod           +| V6         | E5-2637V4 | 3.50GHz | 8 cores  | "Broadwell-EP" (14 nm)     | node[218-219]                                      | HIGH_FREQUENCY | on prod                      
-| V6         | E5-2680V4 | 2.40GHz | 28 cores | "Broadwell-EP" (14 nm)     | node[203]                                       | on prod           +| V6         | E5-2643V4 | 3.40GHz | 12 cores | "Broadwell-EP" (14 nm)     | node[202,204,216-217]                              | HIGH_FREQUENCY | on prod                      
-| V7         | EPYC-7601 | 2.20GHz | 64 cores | "Naples" (14 nm)           | gpu[011]                                        | on prod           +| V6         | E5-2680V4 | 2.40GHz | 28 cores | "Broadwell-EP" (14 nm)     | node[203]                                                         | on prod                      
-| V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | node[273-277,285-288,312-320] gpu[013-031] |              | on prod           +| V7         | EPYC-7601 | 2.20GHz | 64 cores | "Naples" (14 nm)           | gpu[011]                                                          | on prod                      
-| V9         | GOLD-6240 | 2.60GHz | 36 cores | "Cascade Lake" (14 nm)     | node[265-272]                                   | on prod           |+| V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | node[273-277,285-288,312-335] gpu[013-031]                        | on prod                      
 +| V9         | GOLD-6240 | 2.60GHz | 36 cores | "Cascade Lake" (14 nm)     | node[265-272]                                                     | on prod                      |
  
  
Line 230: Line 267:
 | RTX 2080 Ti | Turing       | 11GB | 7.5               | turing       | 8         | gpu[012-016]     | | RTX 2080 Ti | Turing       | 11GB | 7.5               | turing       | 8         | gpu[012-016]     |
 | RTX 2080 Ti | Turing       | 11GB | 7.5               | turing       | 4         | gpu[018-019]     | | RTX 2080 Ti | Turing       | 11GB | 7.5               | turing       | 4         | gpu[018-019]     |
-| RTX 3090    | Ampere       | 25GB | 8.6               | ampere       | 8         | gpu[017,021,026,034-035] |+| RTX 3090    | Ampere       | 25GB | 8.6               | ampere       | 8         | gpu[017,021,025-026,034-035] |
 | RTX A5000   | Ampere       | 25GB | 8.6               | ampere       | 8         | gpu[044]         | | RTX A5000   | Ampere       | 25GB | 8.6               | ampere       | 8         | gpu[044]         |
 | RTX 3080    | Ampere       | 10GB | 8.6               | ampere       | 8         | gpu[023-024,036-43] | | RTX 3080    | Ampere       | 10GB | 8.6               | ampere       | 8         | gpu[023-024,036-43] |
Line 236: Line 273:
 | A100        | Ampere       | 40GB | 8.0               | ampere       | 6         | gpu[022]         | | A100        | Ampere       | 40GB | 8.0               | ampere       | 6         | gpu[022]         |
 | A100        | Ampere       | 40GB | 8.0               | ampere       | 1         | gpu[028]         | | A100        | Ampere       | 40GB | 8.0               | ampere       | 1         | gpu[028]         |
-| A100        | Ampere       | 80GB | 8.0               | ampere       | 4         | gpu[029]         | 
 | A100        | Ampere       | 40GB | 8.0               | ampere       | 4         | gpu[020,030-031] | | A100        | Ampere       | 40GB | 8.0               | ampere       | 4         | gpu[020,030-031] |
 +| A100        | Ampere       | 80GB | 8.0               | ampere       | 4         | gpu[029]         |
 | A100        | Ampere       | 80GB | 8.0               | ampere       | 3         | gpu[032-033]     | | A100        | Ampere       | 80GB | 8.0               | ampere       | 3         | gpu[032-033]     |
 +| A100        | Ampere       | 80GB | 8.0               | ampere       | 2         | gpu[045]         |
      
  
hpc/hpc_clusters.1683637298.txt.gz · Last modified: 2023/05/09 15:01 by Gaël Rossignol