User Tools

Site Tools


hpc:hpc_clusters

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
hpc:hpc_clusters [2023/06/22 09:48]
Yann Sagon [The clusters : Baobab and Yggdrasil]
hpc:hpc_clusters [2024/04/09 16:07] (current)
Yann Sagon [Price per hour]
Line 47: Line 47:
  
 All those servers (login, compute, management and storage nodes) : All those servers (login, compute, management and storage nodes) :
-  * run with the GNU/Linux distribution [[https://www.centos.org/|CentOS]].+  * run with the GNU/Linux distribution [[https://rockylinux.org/|Rocky]].
   * are inter-connected on high speed InfiniBand network   * are inter-connected on high speed InfiniBand network
     * 40Gbit/s (QDR) for Baobab.     * 40Gbit/s (QDR) for Baobab.
Line 69: Line 69:
 that will use only CPU or GPU nodes.  that will use only CPU or GPU nodes. 
  
-===== Private nodes ===== +===== Cost model =====
  
 <note important>**Important update, draft preview.** <note important>**Important update, draft preview.**
Line 80: Line 80:
 In cases where research groups have already purchased compute nodes, we offer them the opportunity to convert their ownership into credits for shares. We estimate that a compute node typically lasts for at least 6 years under normal conditions, and this conversion option ensures that the value of their existing investment is not lost. In cases where research groups have already purchased compute nodes, we offer them the opportunity to convert their ownership into credits for shares. We estimate that a compute node typically lasts for at least 6 years under normal conditions, and this conversion option ensures that the value of their existing investment is not lost.
 </note> </note>
 +
 +==== Price per hour ====
 +Overview:
 +{{:hpc:pasted:20240404-092421.png}}
 +
 +You can find the whole table that you can send to the FNS {{:hpc:hpc:acrobat_2024-04-09_15-58-28.png?linkonly|here}}.
 +
 +==== Private nodes ====
  
 Research groups can buy "private" nodes to add in our clusters, which means their research group has a  Research groups can buy "private" nodes to add in our clusters, which means their research group has a 
Line 87: Line 95:
 Rules: Rules:
   * The compute node remains the research group property   * The compute node remains the research group property
-  * The compute node has a three years warranty. If it fails after the warranty expiration, the repair cost is to be paid at 100% by the research group+  * There is a three-year warranty on the compute node. If there is a failure after the warranty period100% of the repair costs will be the responsibility of the research group. If the node is out of order, you have the option to have it repaired. In order to get a quote, we need to send the compute node to the vendor and the initial cost they will charge to do a quick diagnostic and make a quote is a maximum of 420 CHF, even if the node can't be repaired (worst case).
   * The research groups hasn't an admin right on it   * The research groups hasn't an admin right on it
   * The compute node is installed and maintained by the HPC team in the same way as the other compute nodes   * The compute node is installed and maintained by the HPC team in the same way as the other compute nodes
   * The HPC team can decide to decommission the node when it is too old but the hardware will be in production for at least four years   * The HPC team can decide to decommission the node when it is too old but the hardware will be in production for at least four years
 +  * 
 +
 +
  
 See the [[hpc/slurm#partitions|partitions]] section to have more details about the integration of your private node in the cluster. See the [[hpc/slurm#partitions|partitions]] section to have more details about the integration of your private node in the cluster.
Line 192: Line 203:
  
 See [[hpc::slurm#gpgpu_jobs|here]] how to request GPU for your jobs. See [[hpc::slurm#gpgpu_jobs|here]] how to request GPU for your jobs.
 +
 +
 +==== Bamboo (coming soon) ====
 +
 +^ Generation ^ Model     ^ Freq    ^ Nb cores ^ Architecture               ^ Nodes         ^ Memory              ^Extra flag    ^ Status            ^
 +| V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | node[001-043] | 512GB                            | to be installed   |
 +| V8         | EPYC-72F3 | 3.7GHz  | 16 cores | "Rome" (7 nm)              | node[044-045] | 1TB                 |BIG_MEM       | to be installed   |
 +
 +
 +^ GPU model   ^ Architecture ^ Mem  ^ Compute Capability ^ Slurm resource ^ Nb per node ^ Nodes            ^ Peer access between GPUs ^
 +| RTX 3090    | Ampere       | 25GB | 8.6                | ampere         | 8           | gpu[001,002]     | NO                       |
 +| A100        | Ampere       | 80GB | 8.0                | amper          | 4           | gpu[003]         | YES                      |
 +
  
 ==== Baobab ==== ==== Baobab ====
Line 199: Line 223:
 Since our clusters are regularly expanded, the nodes are not all from the same generation. You can see the details in the following table. Since our clusters are regularly expanded, the nodes are not all from the same generation. You can see the details in the following table.
  
-^ Generation ^ Model     ^ Freq    ^ Nb cores ^ Architecture               ^ Nodes                              ^Extra flag    ^ Status  +^ Generation ^ Model     ^ Freq    ^ Nb cores ^ Architecture               ^ Nodes                                              ^Extra flag      ^ Status                       | 
-| V2         | X5650     | 2.67GHz | 12 cores | "Westmere-EP" (32 nm)      | node[093-101,103-111,140-153                    | decommissioned               | +| V2         | X5650     | 2.67GHz | 12 cores | "Westmere-EP" (32 nm)      | node[093-101,103-111,140-153                                      | decommissioned               
-| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[001-005,007-023,025-056,058]                       | to be decommissioned in 2022 | +| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[009-010,012-018,020-025,029-044]              |                | decommissioned in 2023       
-| V3         | E5-2670V0 | 2.60GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[059,061-062]                               | to be decommissioned in 2022 | +| V3         | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[001-005,007-008,011,019,026-028,045-056,058]                 | to be decommissioned in 2022 | 
-| V3         | E5-4640V0 | 2.40GHz | 32 cores | "Sandy Bridge-EP" (32 nm)  | node[186]                                       | to be decommissioned in 2022 | +| V3         | E5-2670V0 | 2.60GHz | 16 cores | "Sandy Bridge-EP" (32 nm)  | node[059,061-062]                                                 | to be decommissioned in 2022 | 
-| V4         | E5-2650V2 | 2.60GHz | 16 cores | "Ivy Bridge-EP" (22 nm)    | node[063-066,154-172]                           | to be decommissioned in 2022 | +| V3         | E5-4640V0 | 2.40GHz | 32 cores | "Sandy Bridge-EP" (32 nm)  | node[186]                                                         | to be decommissioned in 2022 | 
-| V5         | E5-2643V3 | 3.40GHz | 12 cores | "Haswell-EP" (22 nm)       | gpu[002,012]                                    | on prod           +| V4         | E5-2650V2 | 2.60GHz | 16 cores | "Ivy Bridge-EP" (22 nm)    | node[063-066,154-172]                                             | to be decommissioned in 2022 | 
-| V6         | E5-2630V4 | 2.20GHz | 20 cores | "Broadwell-EP" (14 nm)     | node[173-185,187-201,205-213]                   | on prod           +| V5         | E5-2643V3 | 3.40GHz | 12 cores | "Haswell-EP" (22 nm)       | gpu[002,012]                                                      | on prod                      
-|  :::        :::      |  :::    |  :::      :::                       | gpu[004-010]                       | :::          | on prod           +| V6         | E5-2630V4 | 2.20GHz | 20 cores | "Broadwell-EP" (14 nm)     | node[173-185,187-201,205-213]                                     | on prod                      |  
-| V6         | E5-2637V4 | 3.50GHz | 8 cores  | "Broadwell-EP" (14 nm)     | node[218-219]                      |HIGH_FREQUENCY| on prod           +|  :::        :::      |  :::    |  :::      :::                       | gpu[004-010]                                       | :::            | on prod                      
-| V6         | E5-2643V4 | 3.40GHz | 12 cores | "Broadwell-EP" (14 nm)     | node[202,204,216-217]              |HIGH_FREQUENCY| on prod           +| V6         | E5-2637V4 | 3.50GHz | 8 cores  | "Broadwell-EP" (14 nm)     | node[218-219]                                      | HIGH_FREQUENCY | on prod                      
-| V6         | E5-2680V4 | 2.40GHz | 28 cores | "Broadwell-EP" (14 nm)     | node[203]                                       | on prod           +| V6         | E5-2643V4 | 3.40GHz | 12 cores | "Broadwell-EP" (14 nm)     | node[202,204,216-217]                              | HIGH_FREQUENCY | on prod                      
-| V7         | EPYC-7601 | 2.20GHz | 64 cores | "Naples" (14 nm)           | gpu[011]                                        | on prod           +| V6         | E5-2680V4 | 2.40GHz | 28 cores | "Broadwell-EP" (14 nm)     | node[203]                                                         | on prod                      
-| V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | node[273-277,285-288,312-320] gpu[013-031] |              | on prod           +| V7         | EPYC-7601 | 2.20GHz | 64 cores | "Naples" (14 nm)           | gpu[011]                                                          | on prod                      
-| V9         | GOLD-6240 | 2.60GHz | 36 cores | "Cascade Lake" (14 nm)     | node[265-272]                                   | on prod           |+| V8         | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm)              | node[273-277,285-288,312-335] gpu[013-031]                        | on prod                      
 +| V9         | GOLD-6240 | 2.60GHz | 36 cores | "Cascade Lake" (14 nm)     | node[265-272]                                                     | on prod                      |
  
  
Line 242: Line 267:
 | RTX 2080 Ti | Turing       | 11GB | 7.5               | turing       | 8         | gpu[012-016]     | | RTX 2080 Ti | Turing       | 11GB | 7.5               | turing       | 8         | gpu[012-016]     |
 | RTX 2080 Ti | Turing       | 11GB | 7.5               | turing       | 4         | gpu[018-019]     | | RTX 2080 Ti | Turing       | 11GB | 7.5               | turing       | 4         | gpu[018-019]     |
-| RTX 3090    | Ampere       | 25GB | 8.6               | ampere       | 8         | gpu[017,021,026,034-035] |+| RTX 3090    | Ampere       | 25GB | 8.6               | ampere       | 8         | gpu[017,021,025-026,034-035] |
 | RTX A5000   | Ampere       | 25GB | 8.6               | ampere       | 8         | gpu[044]         | | RTX A5000   | Ampere       | 25GB | 8.6               | ampere       | 8         | gpu[044]         |
 | RTX 3080    | Ampere       | 10GB | 8.6               | ampere       | 8         | gpu[023-024,036-43] | | RTX 3080    | Ampere       | 10GB | 8.6               | ampere       | 8         | gpu[023-024,036-43] |
Line 248: Line 273:
 | A100        | Ampere       | 40GB | 8.0               | ampere       | 6         | gpu[022]         | | A100        | Ampere       | 40GB | 8.0               | ampere       | 6         | gpu[022]         |
 | A100        | Ampere       | 40GB | 8.0               | ampere       | 1         | gpu[028]         | | A100        | Ampere       | 40GB | 8.0               | ampere       | 1         | gpu[028]         |
-| A100        | Ampere       | 80GB | 8.0               | ampere       | 4         | gpu[029]         | 
 | A100        | Ampere       | 40GB | 8.0               | ampere       | 4         | gpu[020,030-031] | | A100        | Ampere       | 40GB | 8.0               | ampere       | 4         | gpu[020,030-031] |
 +| A100        | Ampere       | 80GB | 8.0               | ampere       | 4         | gpu[029]         |
 | A100        | Ampere       | 80GB | 8.0               | ampere       | 3         | gpu[032-033]     | | A100        | Ampere       | 80GB | 8.0               | ampere       | 3         | gpu[032-033]     |
 +| A100        | Ampere       | 80GB | 8.0               | ampere       | 2         | gpu[045]         |
      
  
hpc/hpc_clusters.1687420113.txt.gz · Last modified: 2023/06/22 09:48 by Yann Sagon