{{METATOC 1-5}} ====== How our clusters work ====== We expect the HPC clusters users to know what an HPC cluster is and what parallel computing is. As all HPC clusters are different, it is important for any users to have a general understanding of the clusters they are working on, what they offer and what are their limitations. This section gives an overview of the technical HPC infrastructure and how things work at the University of Geneva. More details can be found in the corresponding sections of this documentation. The last part of this page gives more details for advanced users. ====== The clusters : Baobab and Yggdrasil ====== The University of Geneva owns two HPC clusters or supercomputers : **Baobab** and **Yggdrasil**. As for now, they are two completely separated entities, each of them with their own private network, storage, and login node. What is shared is the accounting (user accounts and job usage). Choose the right cluster for your work: * You can use both clusters, but try to stick to one of them. * Use the cluster where the private partition you have access to is located. * If you need to access other servers not located in Astro, use Baobab to save bandwidth. * As your data are stored locally on each cluster, avoid to use both clusters if this involves a lot of data moving between cluster. * Use Yggdrasil if you need newer CPUs, compute nodes with up to 1.5TB of memory, volta GPU. You can't submit jobs from one cluster to the other one. This may be done in the future. Boabab is physically located at Uni Dufour in Geneva downtown, while Yggdrasil is located at the [[https://www.unige.ch/sciences/astro/en/about-us/astronomy-department/|Observatory of Geneva]] in Sauverny. ^ cluster name ^ datacentre ^ Interconnect ^ public CPU ^ public GPU ^ Total CPU size ^ Total GPU size ^ | Baobab | Dufour | IB 40GB QDR | ~900 | 0 | ~9'736 | 271 | | Yggdrasil | Astro | IB 100GB EDR | ~3000 | 44 | ~8'228 | 52 | ====== How do our clusters work ? ====== ===== Overview ===== Each cluster is composed of : * a **login node** (aka **headnode**) allowing users to connect and submit //jobs// to the cluster. * many **compute nodes** which provide the computing power. The compute nodes are not all identical ; they all provide CPU cores (from 8 to 128 cores depending on the model), and some nodes also have GPUs or more RAM (see [[hpc/hpc_clusters#compute_nodes|below]]). * **management servers** that you don't need to worry about, that's the HPC engineers' job. The management servers are here to provide the necessary services such as all the applications (with EasyBuild / module), Slurm job management and queuing system, ways for the HPC engineers to (re-)deploy compute nodes automatically, etc. * **BeeGFS** storage servers which provide "fast" parallel file system to store the data from your ''$HOME'' and for the scratch data (''$HOME/scratch''). All those servers (login, compute, management and storage nodes) : * run with the GNU/Linux distribution [[https://rockylinux.org/|Rocky]]. * are inter-connected on high speed InfiniBand network * 40Gbit/s (QDR) for Baobab. * 100Gbit/s (EDR) for Yggdrasil. {{ :hpc:interdc.png?direct&800 |}} In order to provide hundreds of software and versions, we use EasyBuild / module. It allows you to load the exact version of a software/library that is compatible with your code. [[hpc/applications_and_libraries|Learn more about EasyBuild/module]] When you want to use some cluster's resources, you need to connect to the login node and submit a //job// to Slurm which is our job management and queuing system. The job is submitted with an //sbatch// script (a Bash/shell script with special instructions for Slurm such as how many CPU you need, which //Slurm partition// to use how long your script will run and how to execute your code). Slurm will place your job in a queue with other users' jobs, and find the fastest way to provide the resources you asked for. When the resources are available, your job will start.\\ [[hpc/slurm|Learn more about Slurm]] One important note about Slurm is the concept of //partition//. When you submit a job, you have to specify a //partition// that will give you access to some specific resources. For instance, you can submit a job that will use only CPU or GPU nodes. ===== Cost model ===== **Important update, draft preview.** We are currently in the process of implementing changes to the investment approach for the HPC service Baobab, wherein research groups will no longer purchase physical nodes as their property. Instead, they will have the option to pay for a share and duration of usage. This new approach offers several advantages for both the research groups and us as the service provider. For research groups, the main advantage lies in the increased flexibility it provides. They can now tailor their investments to suit the specific needs of their projects, scaling their usage as required. This eliminates the constraints of owning physical nodes and allows for more efficient allocation of resources. As the service provider, we benefit from this new investment model as well. We can now strategically purchase compute nodes and hardware based on the actual demand from research groups, ensuring that our investments align with their usage patterns. This allows us to optimize resource utilization and make timely acquisitions when needed. In cases where research groups have already purchased compute nodes, we offer them the opportunity to convert their ownership into credits for shares. We estimate that a compute node typically lasts for at least 6 years under normal conditions, and this conversion option ensures that the value of their existing investment is not lost. ==== Price per hour ==== Overview: {{:hpc:pasted:20240404-092421.png}} You can find the whole table that you can send to the FNS {{:hpc:hpc:acrobat_2024-04-09_15-58-28.png?linkonly|here}}. ==== Private nodes ==== Research groups can buy "private" nodes to add in our clusters, which means their research group has a //private partition// with a higher priority to use those specific nodes (less waiting time) and they can run their jobs for a longer time (7 days instead of 4 for public compute nodes). Rules: * The compute node remains the research group property * There is a three-year warranty on the compute node. If there is a failure after the warranty period, 100% of the repair costs will be the responsibility of the research group. If the node is out of order, you have the option to have it repaired. In order to get a quote, we need to send the compute node to the vendor and the initial cost they will charge to do a quick diagnostic and make a quote is a maximum of 420 CHF, even if the node can't be repaired (worst case). * The research groups hasn't an admin right on it * The compute node is installed and maintained by the HPC team in the same way as the other compute nodes * The HPC team can decide to decommission the node when it is too old but the hardware will be in production for at least four years * See the [[hpc/slurm#partitions|partitions]] section to have more details about the integration of your private node in the cluster. Current price of a compute node is: -- AMD -- * 2 x 64 Core AMD EPYC 7742 2.25GHz Processor * 512GB DDR4 3200MHz Memory (16x32GB) * 100G IB EDR card * 960GB SSD * ~ 14'442.55 CHF TTC -- GPU A100 with AMD -- * 1 x 64 Core AMD EPYC 7742 2.25GHz Processor * 256GB DDR4 3200MHz ECC Server Memory (8x 32GB / 0 free slots) * 1 x 1.92TB SATAIII Intel 24x7 Datacenter SSD (6.5PB written until warranty end) * 1 x nVidia Tesla A100 80GB PCIe GPU passive cooled (max. 4 GPUs possible) * ~ 24'300 CHF TTC * ~ 11'270 CHF TTC per extra GPU -- GPU RTX3090 with AMD -- * 2 x 64 Core AMD EPYC 7742 2.25GHz Processor * 512GB DDR4 3200MHz ECC Server Memory * 8 x nVidia RTX 3090 24GB Graphics Controller * ~ 42'479 CHF TTC We usually install and order the nodes twice per year. If you want to ask a financial contribution from UNIGE you must complete a COINF application : https://www.unige.ch/rectorat/commissions/coinf/appel-a-projets ====== How do I use your clusters ? ====== Everyone has different needs for their computation. A typical example of usage of the cluster would consists of these steps : * connect to the login node * this will give you access to the data from your ''$HOME'' directory * execute an sbatch script which will request resources to Slurm for the estimated runtime (i.e. : 16 CPU cores, 8 GB RAM for up to 7h on partition "shared-cpu"). The sbatch will contain instructions/commands : * for Slurm scheduler to access compute resources for a certain time * to load the right application and libraries with ''module'' for your code to work * on how to execute your application. * the Slurm job will be placed in the Slurm queue * once the requested resources are available, your job will start and be executed on one or multiple compute nodes (which can all access the BeeGFS ''$HOME'' and ''$HOME/scratch'' directories). * all communication and data transfer between the nodes, the storage and the login node go through the InfiniBand network. If you want to know what type of CPU and architecture is supported, check the section [[hpc/hpc_clusters#for_advanced_users|For Advanced users]]. ====== For advanced users ====== ===== Infrastructure schema ===== FIXME ===== Compute nodes ===== Both clusters contain a mix of "public" nodes provided by the University of Geneva, a "private" nodes in general paid 50% by the University and 50% by a research group for instance. Any user of the clusters can request compute resources on any node (public and private), but a research group who owns "private" nodes has a higher priority on its "private" nodes and can request a longer execution time. ==== GPUs models on the clusters ==== We have several GPU models on the cluster. You can find here a table of what is available. On Baobab ^ Model ^ Memory ^ GRES ^ Constraint gpu arch ^ Compute Capability ^ minimum CUDA version ^ Precision ^ Weight | | Titan X | 12GB | titan | COMPUTE_TYPE_TITAN |COMPUTE_CAPABILITY_6_1 | 8.0 | SIMPLE_PRECISION_GPU | 10 | | P100 | 12GB | pascal | COMPUTE_TYPE_PASCAL |COMPUTE_CAPABILITY_6_0 | 8.0 | DOUBLE_PRECISION_GPU | 20 | | RTX 2080 Ti | 11GB | turing | COMPUTE_TYPE_TURING |COMPUTE_CAPABILITY_7_5 | 10.0 | SIMPLE_PRECISION_GPU | 30 | | RTX 3080 | 10GB | ampere | COMPUTE_TYPE_AMPERE |COMPUTE_CAPABILITY_8_6 | 11.0 | SIMPLE_PRECISION_GPU | 40 | | RTX 3090 | 25GB | ampere | COMPUTE_TYPE_AMPERE |COMPUTE_CAPABILITY_8_6 | 11.0 | SIMPLE_PRECISION_GPU | 50 | | A5000 | 25GB | ampere | COMPUTE_TYPE_AMPERE |COMPUTE_CAPABILITY_8_6 | 11.0 | SIMPLE_PRECISION_GPU | 50 | | A100 | 40GB | ampere | COMPUTE_TYPE_AMPERE |COMPUTE_CAPABILITY_8_0 | 11.0 | DOUBLE_PRECISION_GPU | 60 | | A100 | 80GB | ampere | COMPUTE_TYPE_AMPERE |COMPUTE_CAPABILITY_8_0 | 11.0 | DOUBLE_PRECISION_GPU | 70 | If more than one GPU model can be selected if you didn't specify a constraint, they are allocated in the the same order as they are listed in the table. The low end GPU first (GPU with a lower weight are selected first). On Yggdrasil ^ Model ^ Memory ^ GRES ^ Constraint gpu arch ^ Compute Capability ^ Precision ^ Weight | | Titan RTX | 24GB | turing | COMPUTE_TYPE_TURING |COMPUTE_CAPABILITY_7.5 | SIMPLE_PRECISION_GPU | 10 | | V100 | 32GB | volta | COMPUTE_TYPE_VOLTA |COMPUTE_CAPABILITY_7.0 | DOUBLE_PRECISION_GPU | 20 | When you request a GPU, you can either specify no model at all or you can give specific constraints such as double precision. If you are doing machine learning for example, you DON'T need double precision. Double precision is useful for software doing for example physical numerical simulations. We don't have mixed GPUs models on the same node. Every GPU node has only one GPU model. See [[hpc::slurm#gpgpu_jobs|here]] how to request GPU for your jobs. ==== Bamboo (coming soon) ==== ^ Generation ^ Model ^ Freq ^ Nb cores ^ Architecture ^ Nodes ^ Memory ^Extra flag ^ Status ^ | V8 | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm) | node[001-043] | 512GB | | to be installed | | V8 | EPYC-72F3 | 3.7GHz | 16 cores | "Rome" (7 nm) | node[044-045] | 1TB |BIG_MEM | to be installed | ^ GPU model ^ Architecture ^ Mem ^ Compute Capability ^ Slurm resource ^ Nb per node ^ Nodes ^ Peer access between GPUs ^ | RTX 3090 | Ampere | 25GB | 8.6 | ampere | 8 | gpu[001,002] | NO | | A100 | Ampere | 80GB | 8.0 | amper | 4 | gpu[003] | YES | ==== Baobab ==== === CPUs on Baobab === Since our clusters are regularly expanded, the nodes are not all from the same generation. You can see the details in the following table. ^ Generation ^ Model ^ Freq ^ Nb cores ^ Architecture ^ Nodes ^Extra flag ^ Status | | V2 | X5650 | 2.67GHz | 12 cores | "Westmere-EP" (32 nm) | node[093-101,103-111,140-153 | | decommissioned | | V3 | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm) | node[009-010,012-018,020-025,029-044] | | decommissioned in 2023 | | V3 | E5-2660V0 | 2.20GHz | 16 cores | "Sandy Bridge-EP" (32 nm) | node[001-005,007-008,011,019,026-028,045-056,058] | | to be decommissioned in 2022 | | V3 | E5-2670V0 | 2.60GHz | 16 cores | "Sandy Bridge-EP" (32 nm) | node[059,061-062] | | to be decommissioned in 2022 | | V3 | E5-4640V0 | 2.40GHz | 32 cores | "Sandy Bridge-EP" (32 nm) | node[186] | | to be decommissioned in 2022 | | V4 | E5-2650V2 | 2.60GHz | 16 cores | "Ivy Bridge-EP" (22 nm) | node[063-066,154-172] | | to be decommissioned in 2022 | | V5 | E5-2643V3 | 3.40GHz | 12 cores | "Haswell-EP" (22 nm) | gpu[002,012] | | on prod | | V6 | E5-2630V4 | 2.20GHz | 20 cores | "Broadwell-EP" (14 nm) | node[173-185,187-201,205-213] | | on prod | | ::: | ::: | ::: | ::: | ::: | gpu[004-010] | ::: | on prod | | V6 | E5-2637V4 | 3.50GHz | 8 cores | "Broadwell-EP" (14 nm) | node[218-219] | HIGH_FREQUENCY | on prod | | V6 | E5-2643V4 | 3.40GHz | 12 cores | "Broadwell-EP" (14 nm) | node[202,204,216-217] | HIGH_FREQUENCY | on prod | | V6 | E5-2680V4 | 2.40GHz | 28 cores | "Broadwell-EP" (14 nm) | node[203] | | on prod | | V7 | EPYC-7601 | 2.20GHz | 64 cores | "Naples" (14 nm) | gpu[011] | | on prod | | V8 | EPYC-7742 | 2.25GHz | 128 cores| "Rome" (7 nm) | node[273-277,285-288,312-335] gpu[013-031] | | on prod | | V9 | GOLD-6240 | 2.60GHz | 36 cores | "Cascade Lake" (14 nm) | node[265-272] | | on prod | The "generation" column is just a way to classify the nodes on our clusters. In the following table you can see the features of each architecture. ^ ^MMX^SSE^SSE2^SSE3^SSSE3^SSE4.1^SSE4.2^AVX^F16C^AVX2^FMA3^NB AVX-512 FMA^ |Westmere-EP |YES|YES|YES |YES |YES |YES |YES |NO |NO |NO |NO | | |Sandy Bridge-EP|YES|YES|YES |YES |YES |YES |YES |YES|NO |NO |NO | | |Ivy Bridge-EP |YES|YES|YES |YES |YES |YES |YES |YES|YES |NO |NO | | |Haswell-EP |YES|YES|YES |YES |YES |YES |YES |YES|YES |YES |NO | | |Broadwell-EP |YES|YES|YES |YES |YES |YES |YES |YES|YES |YES |YES | | |Naples |YES|YES|YES |YES |YES |YES |YES |YES|YES |YES |YES | | |Rome |YES|YES|YES |YES |YES |YES |YES |YES|YES |YES |YES | | |Cascade Lake |YES|YES|YES |YES |YES |YES |YES |YES|YES |YES |YES | 2 | === GPUs on Baobab === In the following table you can see which type of GPU is available on Baobab. ^ GPU model ^ Architecture ^ Mem ^ Compute Capability^Slurm resource^Nb per node^ Nodes ^ | Titan X | Pascal | 12GB | 6.1 | titan | 6 | gpu[002] | | P100 | Pascal | 12GB | 6.0 | pascal | 6 | gpu[004] | | P100 | Pascal | 12GB | 6.0 | pascal | 5 | gpu[005] | | P100 | Pascal | 12GB | 6.0 | pascal | 8 | gpu[006] | | P100 | Pascal | 12GB | 6.0 | pascal | 4 | gpu[007] | | Titan X | Pascal | 12GB | 6.1 | titan | 8 | gpu[008-010] | | RTX 2080 Ti | Turing | 11GB | 7.5 | turing | 2 | gpu[011] | | RTX 2080 Ti | Turing | 11GB | 7.5 | turing | 8 | gpu[012-016] | | RTX 2080 Ti | Turing | 11GB | 7.5 | turing | 4 | gpu[018-019] | | RTX 3090 | Ampere | 25GB | 8.6 | ampere | 8 | gpu[017,021,025-026,034-035] | | RTX A5000 | Ampere | 25GB | 8.6 | ampere | 8 | gpu[044] | | RTX 3080 | Ampere | 10GB | 8.6 | ampere | 8 | gpu[023-024,036-43] | | A100 | Ampere | 40GB | 8.0 | ampere | 2 | gpu[027] | | A100 | Ampere | 40GB | 8.0 | ampere | 6 | gpu[022] | | A100 | Ampere | 40GB | 8.0 | ampere | 1 | gpu[028] | | A100 | Ampere | 40GB | 8.0 | ampere | 4 | gpu[020,030-031] | | A100 | Ampere | 80GB | 8.0 | ampere | 4 | gpu[029] | | A100 | Ampere | 80GB | 8.0 | ampere | 3 | gpu[032-033] | | A100 | Ampere | 80GB | 8.0 | ampere | 2 | gpu[045] | Link to see the GPU details https://developer.nvidia.com/cuda-gpus#compute ==== Yggdrasil ==== === CPUs on Yggdrasil === Since our clusters are regularly expanded, the nodes are not all from the same generation. You can see the details in the following table. ^ Generation ^ Model ^ Freq ^ Nb cores ^ Architecture ^ Nodes ^Extra flag ^ | V9 | [[https://ark.intel.com/content/www/fr/fr/ark/products/192443/intel-xeon-gold-6240-processor-24-75m-cache-2-60-ghz.html|GOLD-6240]] | 2.60GHz | 36 cores | "Intel Xeon Gold 6240 CPU @ 2.60GHz" | cpu[001-023,024-111,120-122] | | | V9 | [[https://ark.intel.com/content/www/us/en/ark/products/192442/intel-xeon-gold-6244-processor-24-75m-cache-3-60-ghz.html|GOLD-6244]] | 3.60GHz | 16 cores | "Intel Xeon Gold 6244 CPU @ 3.60GHz" | cpu[112-115] | | | V8 | EPYC-7742 | 2.25GHz | 128 cores | "AMD EPYC 7742 64-Core Processor" | cpu[116-119,123-150] | | | V9 | [[https://ark.intel.com/content/www/fr/fr/ark/products/193390/intel-xeon-silver-4208-processor-11m-cache-2-10-ghz.html|SILVER-4208]] | 2.10GHz | 16 cores | "Intel Xeon Silver 4208 CPU @ 2.10GHz" | gpu[001-006,008] | | | V9 | [[https://ark.intel.com/content/www/us/en/ark/products/193954/intel-xeon-gold-6234-processor-24-75m-cache-3-30-ghz.html|GOLD-6234]] | 3.30GHz | 16 cores | "Intel Xeon Gold 6234 CPU @ 3.30GHz" | gpu[007] | | The "generation" column is just a way to classify the nodes on our clusters. In the following table you can see the features of each architecture. ^ ^SSE4.2 ^AVX ^AVX2 ^NB AVX-512 FMA ^ |Intel Xeon Gold 6244 | YES | YES | YES | 2 | |Intel Xeon Gold 6240 | YES | YES | YES | 2 | |Intel Xeon Gold 6234 | YES | YES | YES | 2 | |Intel Xeon Silver 4208 | YES | YES | YES | 1 | Click here to [[https://ark.intel.com/content/www/us/en/ark/compare.html?productIds=193390,193954,192443,192442|Compare Intel CPUS]]. === GPUs on Yggdrasil === In the following table you can see which type of GPU is available on Yggdrasil. ^ GPU model ^ Architecture ^ Mem ^ Compute Capability ^ Slurm resource ^ Nb per node ^ Nodes ^ Peer access between GPUs ^ | Titan RTX | Turing | 24GB | 7.5 | turing | 8 | gpu[001,002,004] | NO | | Titan RTX | Turing | 24GB | 7.5 | turing | 6 | gpu[003,005] | NO | | Titan RTX | Turing | 24GB | 7.5 | turing | 4 | gpu[006,007] | NO | | V100 | Volta | 32GB | 7.0 | volta | 1 | gpu[008] | YES | Link to see the GPU details https://developer.nvidia.com/cuda-gpus#compute ====== Monitoring performance ====== In order to follow system ressources, you can go to https://monitor.hpc.unige.ch/dashboards You can reach node metrics for the last 30 days and BeeGFS metrics for the last 6 months. For checking resources on a specific node, go to "Baobab - General" or "Yggdrasil - General" and click on "Host Overview - Single". You will be able to choose the node you want to check in the form at the top : {{ :hpc:c3.png?400 |}} For going back to the dashboard list, click on the four squares on the left panel : {{ :hpc:c1.png?400 |}} The "General" dashboard in "Yggdrasil - General" and "Baobab - General" folders gives an overview of the cluster : total load and memory used, and how many nodes are up/down. You can see GPU metrics too under "Cuda - GPU" dashboards. ====== Job accounting ====== If you are interested by your HPC usage, group usage, wait time, etc. we have the right tool for you: Open XDMoD. We track job usage of our clusters here: https://openxdmod.hpc.unige.ch/ If you want to have more details, you need to login. This instance isn't integrated with our SI right now. In order to get an account, you need to clik on "sign in" top left of the page and then on "Don't have an account?".