Differences

This shows you the differences between two versions of the page.

--- hpc:getting_started [2022/08/04 13:26] – [An example] Pierre Kuenzli
+++ hpc:getting_started [2023/06/09 09:07] (current) – Adrien Albert
@@ Line 1: / Line 1: @@
-<title> HPC for dummies </title>
+====== HPC for dummies ======
 This document present in general what are computing clusters and what is High Performance Computing (HPC). It can be read out of curiosity, if you want to know what are those infrastructures for or if you are a potential future user with limited technical background. For more practical information on using unige HPC infrastructure, refer to the rest of the documentation.
@@ Line 9: / Line 9: @@
 When running heavy computations, you can face multiple challenges. If one of your tasks takes several days to complete, you can let it run on your lab's computer by taking care of not turning it off during that time. And now what if you have several of those tasks to run at the same time ? You can do it if the number of tasks is small, while modern computers comes with multiple computing cores (typically 4 to 8) allowing to run several tasks at the same time.
-But what if now you have to run hundreds or thousands of those tasks ? Then the idea is simple : use a lot of computers at the same time. But having lots of computers available is not enough, you need a way to manage them in a centralized way. Otherwise, you would have to connect individually to each computer, run some tasks, wait for completion, and manually gather the results. And still, what happens if you want to share the resources with other people ? You would have to establish some kind of usage schedule. If you want to use hundreds of computers or more, manual management of tasks is simply not an option.
+But what if you have to run hundreds or thousands of those tasks ? Then the idea is simple : use a lot of computers at the same time. But having lots of computers available is not enough, you need a way to manage them in a centralized way. Otherwise, you would have to connect individually to each computer, run some tasks, wait for completion, and manually gather the results. And still, what happens if you want to share the resources with other people ? You would have to establish some kind of usage schedule. If you want to use hundreds of computers or more, manual management of tasks is simply not an option.
 That's where computing clusters comes into play. They are quite literally clusters of computers, interconnected by a network, with a centralized storage and a central tasks (or jobs) management software.
@@ Line 15: / Line 15: @@
 <note>Other technologies exists to run tasks on demand on multiple computers, such as cloud computing or grid/distributed computing, but they each comes with their own advantages and limitations. We can cite a lack of very high performance network needed for large scale parallel computation (we'll come to that later) and lack of high availability and security concerns for distributed computing.</note>
-First clusters where made with commodity hardware and named "Beowulf" clusters and compound of standard computers on which computations runs (the computing nodes), a head node (or login node or front end) from which users interact with the system and a file server accessible by all nodes (computing nodes and login node). Modern High Performance Computing (HPC) clusters relies essentially on the same architecture and software stack with more high end and specialized hardware. Nodes runs under some Linux distribution and a task scheduling tool (slurm in unige case) is available.
+First clusters where made with commodity hardware and named "Beowulf" clusters and compound of standard computers on which computations runs (the computing nodes), a head node (or login node or front end) from which users interact with the system and a file server accessible by all nodes (computing nodes and login node). Modern High Performance Computing (HPC) clusters relies essentially on the same architecture and software stack with more high end and specialized hardware. Nodes runs under some Linux distribution and a task scheduling tool (slurm in unige's case) is available.
 **Documentation :** see [[slurm|documentation on how to use slurm]] on unige's clusters for more informations on the queuing system.
@@ Line 27: / Line 27: @@
 **Documentation :** the [[hpc_glossary|glossary]] gives the meaning of some terms.
-Each node in a cluster is a computer, embedding one or more multicore CPUs, a certain amount of RAM, one or more network interfaces and possibly one or more coprocessors (more often GPUs). Thus, a cluster can be characterized by its number of nodes, quantity of RAM, type and number or CPU cores and GPUs and network bandwith. Some cluster are homogeneous (every node has the same configuration) while others are heterogeneous (there are nodes with different configurations).
+Each node in a cluster is a computer, embedding one or more multicore CPUs, a certain amount of RAM, one or more network interfaces and possibly one or more coprocessors (usually GPUs). Thus, a cluster can be characterized by its number of nodes, quantity of RAM, type and number of CPU cores and GPUs and network bandwith. Some cluster are homogeneous (every node has the same configuration) while others are heterogeneous (there are nodes with different configurations).
 Another very important part of a cluster is its storage. Indeed, software running on a cluster needs to access data to process. In case of clusters, data are stored as close as possible to the compute nodes in storage servers or local storage rather than in some distant server or service, such as cloud storage.
@@ Line 89: / Line 89: @@
   * Clusters are running Linux, which heavily relies on a command line interface. You will be able to perform some tasks with a graphical interface, but at some point you will have to use a command line interface.
   * There are many users using the cluster at the same time. So you have no guarantee your computations will start immediately.
-  * You will not directly run your program on the cluster as on you do on your computer. You will ask the queueing system to run a program, and once resources are available, the queuing system will start the program.
+  * You will not directly run your program on the cluster as you do on your personnal computer. You will ask the queueing system to run a program, and once resources are available, the queuing system will start the program.
   * HPC clusters where not designed for interactive tasks. While doable, they are much better candidates for asynchronous computing (without user interaction).
-As a user, you will interact directly with the login node only. From this computer, you will manage your file, set up your execution configuration and ask the queuing system for computation on the compute nodes. But you will never run your programs directly on the login node.
+As a user, you will interact directly with the login node only. From this computer, you will manage your files, set up your execution configuration and ask the queuing system for computation on the compute nodes. But you will never run your programs directly on the login node.
 {{ :hpc:cluster.png?800 |}}
@@ Line 130: / Line 130: @@
 </code>
-With this file, we are telling the queuing system "I want you to run my multiply program on one processor of one of the computing node, for a maximum of one minute". Of course, when working with more complex program, you will increase those values (number of computing resources and computing time) according to your needs. You could replace the input data by a large data file or a set of files and the program with any complex simulation tool, the principle will be the same.
+With this file, we are telling the queuing system "I want you to run my multiply program on one processor of one of the computing node, for a maximum time of one minute". Of course, when working with more complex program, you will increase those values (number of computing resources and computing time) according to your needs. You could replace the input data by a large data file or a set of files and the program with any complex simulation tool, the principle will be the same.
 So let's do the job. Send those files to the cluster, submit the job to the queuing system and gather results once done.
-{{ :hpc:run_a_job.png?400 |}}
+{{ :hpc:run_a_job.png?500 |}}
 In this image, you see exactly the steps described above achieved from a linux terminal.
@@ Line 146: / Line 146: @@
 ==== Going further ====
-Now that you have an idea of what HPC clusters are, if you are willing to actually use them, your next steps are :
+Now that you have an idea of what HPC clusters are. If you are willing to actually use them, your next steps are :
   * Getting familiar with Linux and command line environment.
   * Go through the rest of the HPC unige documentation to get used to the local infrastructure.
-**Documentation :** some important parts of the unige HPC documentation :
+More particularly, please read the best practices guide which help avoid more common mistakes.
+**Documentation :** [[best_practices]|best practices guide]].
+You can as well find help and advice through our forum, FAQ and direct contact with the HPC admin and support team.
+**Documentation :** [[start#support_-_get_help|support and advice using the cluster]].
+**Documentation :** some other important parts of the unige HPC documentation :
   * [[start|The main page of unige HPC documentation]]