User Tools

Site Tools


hpc:best_practices

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
hpc:best_practices [2020/11/05 18:10]
Massimo Brero
hpc:best_practices [2023/05/26 15:07] (current)
Adrien Albert [First steps]
Line 1: Line 1:
-<title> Best practices and smart use of the HPC resources </title>+{{METATOC 1-5}}
  
  
-This page gives best practices and tips on how to use the clusters **Baobab** and **Yggdrasil**. 
  
 ====== Introduction ====== ====== Introduction ======
 +This page gives best practices and tips on how to use the clusters **Baobab** and **Yggdrasil**.
 +
  
 An HPC cluster is an advanced, complex and always-evolving piece of technology. It's easy to forget details and make mistakes when using one, so don't hesitate to check this section every now and then, yes, even if you are the local HPC guru in your team! There's always something new to learn ! An HPC cluster is an advanced, complex and always-evolving piece of technology. It's easy to forget details and make mistakes when using one, so don't hesitate to check this section every now and then, yes, even if you are the local HPC guru in your team! There's always something new to learn !
Line 14: Line 15:
 For your first steps we recommend the following : For your first steps we recommend the following :
   * Check the [[hpc:best_practices#rules_and_etiquette|Rules and etiquette]].   * Check the [[hpc:best_practices#rules_and_etiquette|Rules and etiquette]].
-  * Connect to the login node of the cluster you are planning to use :  +  * Connect to the login node of the cluster you are planning to use : [[hpc:access_the_hpc_clusters|Access : SSH, X2GO]]
-    * [[hpc:access_the_hpc_clusters|Access : SSH, X2GO]]+
   * Check the rest of this page for best practices and smart use of the HPC resources.   * Check the rest of this page for best practices and smart use of the HPC resources.
     * [[hpc:best_practices#stop_wasting_resources|This page contains important information]]! You can hog resources and/or wait much longer than you could if you don't request the right amount of time for your job, if you ask too much (or not enough) resources, if you are not using the right partition, etc.      * [[hpc:best_practices#stop_wasting_resources|This page contains important information]]! You can hog resources and/or wait much longer than you could if you don't request the right amount of time for your job, if you ask too much (or not enough) resources, if you are not using the right partition, etc. 
-  * Understand how to load your libraries/application with ''module'' +  * Understand how to load your libraries/application with ''module''[[applications_and_libraries|Applications and libraries]] 
-    * [[applications_and_libraries|Applications and libraries]] +  * Learn how to write a Slurm ''sbatch'' script[[slurm|Slurm and job management]]
-  * Learn how to write a Slurm ''sbatch'' script +
-    * [[slurm|Slurm and job management]]+
  
 ====== Rules and etiquette ====== ====== Rules and etiquette ======
Line 38: Line 36:
   * Loading applications and libraries should always be done using ''module''   * Loading applications and libraries should always be done using ''module''
     * Pro tip: you can even force the version to have consistent results or to survive an OS migration, for example.     * Pro tip: you can even force the version to have consistent results or to survive an OS migration, for example.
-    * When an application is not available through ''module'', you can compile binaries in your ''$HOME'' directory and use them on any node of the cluster (since your ''$HOME'' is accessible from any node). Make sure you load the compiler with ''module''.+    * When an application is not available through ''module'', you can compile binaries in your ''$HOME'' directory and use them on any node of the cluster (since your ''$HOME'' is accessible from any node). Make sure you load the [[hpc:applications_and_libraries#choosing_the_compiler_toolchain|compiler]] with ''module''.
     * You can request the HPC team to install a new software or version of a library in order to load it through ''module''. Check [[hpc/applications_and_libraries#what_do_i_do_when_an_application_is_not_available_via_module|this page]]     * You can request the HPC team to install a new software or version of a library in order to load it through ''module''. Check [[hpc/applications_and_libraries#what_do_i_do_when_an_application_is_not_available_via_module|this page]]
   * You **cannot** install new software with ''yum'' (and don't bother with ''apt-get'', we are not running Ubuntu).   * You **cannot** install new software with ''yum'' (and don't bother with ''apt-get'', we are not running Ubuntu).
Line 61: Line 59:
 The same goes for the storage. As you [[hpc/storage_on_hpc|already know]] you shouldn't store any personal files on the clusters. The same goes for the storage. As you [[hpc/storage_on_hpc|already know]] you shouldn't store any personal files on the clusters.
  
-But even with scientific data, if you are storing thousands of files and using hundreds of GB that you don't really need, at some point we will have to buy more storage. The storage servers are no different than compute nodes : they also need electricity to run and AC to be cooled down. So deleted useless files from time to time is a good habit.+But even with scientific data, if you are storing thousands of files and using hundreds of GB that you don't really need, at some point we will have to buy more storage. The storage servers are no different than compute nodes : they also need electricity to run and AC to be cooled down. So deleting useless files from time to time is a good habit.
  
 Besides the quantity of data, remember it is important //where// you store your data. For instance, we back up the content of your ''$HOME'' every day. Once again, backing up large quantities of data relies on a high speed network, backup tapes, a robot to read/load the tapes, etc. Besides the quantity of data, remember it is important //where// you store your data. For instance, we back up the content of your ''$HOME'' every day. Once again, backing up large quantities of data relies on a high speed network, backup tapes, a robot to read/load the tapes, etc.
Line 80: Line 78:
  
   * CPUs, which are grouped in [[hpc/slurm#partitions|partitions]]   * CPUs, which are grouped in [[hpc/slurm#partitions|partitions]]
-  * [[hpc/applications_and_libraries#nvidia_cuda|GPGPUs]] which are accelerator for software that support them+  * [[hpc/hpc_clusters#compute_nodes|GPGPUs]] which are accelerator for software that support them
   * memory (RAM) per core or per node, 3GB by default   * memory (RAM) per core or per node, 3GB by default
   * disk space   * disk space
Line 87: Line 85:
 ===== Single thread vs multi thread vs distributed jobs ===== ===== Single thread vs multi thread vs distributed jobs =====
  
-There are three job categories each with different needs:+See [[hpc:slurm#single_thread_vs_multi_thread_vs_distributed_jobs|here]] to be sure you specify the correct configuration for your job type 
  
-  * **single threaded**, which only uses **one CPU**. 
-    * Example : Python, plain R, etc. 
-  * **multi threaded**, which can use **all the CPUs** of a compute node (best case scenario). 
-    * Example : Matlab, Stata-MP, etc. 
-  * **distributed**, which can spread tasks on various compute nodes ; each task (or worker) requires one CPU. Keyword to identify such program could be OpenMPI. 
-    * Example : Palabos 
-  
-There are also **hybrid** jobs, where each tasks of such a job behave like a multi-threaded job. This is not very common and we won't cover this case. 
- 
-FIXME On the cluster, we have two type of partitions with a fundamental difference: 
- 
-  * with resources allocated per compute node: shared-EL7, parallel-EL7  
-  * with resources allocated per cpu: all the other partitions 
  
 ===== Bad CPU usage ===== ===== Bad CPU usage =====
  
 Let's take an example of a **single threaded job**. You should clearly use a partition which allows to request a single CPU, such as ''public-cpu'' or ''shared-cpu'' and ask one CPU only. If you ask too much CPUs, the resources will be reserved for your job but only one CPU will be used. See the screenshot below of such a bad case where 90% of the compute node is idle. Let's take an example of a **single threaded job**. You should clearly use a partition which allows to request a single CPU, such as ''public-cpu'' or ''shared-cpu'' and ask one CPU only. If you ask too much CPUs, the resources will be reserved for your job but only one CPU will be used. See the screenshot below of such a bad case where 90% of the compute node is idle.
- 
  
 {{ :hpc:wasted_resources_mono_threaded_job_1.png?nolink&450 |}} {{ :hpc:wasted_resources_mono_threaded_job_1.png?nolink&450 |}}
-image 
  
  
Line 147: Line 130:
     * This will help you choose the parameters ''<nowiki>--ntasks</nowiki>'' and ''<nowiki>--cpus-per-task</nowiki>''      * This will help you choose the parameters ''<nowiki>--ntasks</nowiki>'' and ''<nowiki>--cpus-per-task</nowiki>'' 
   * [[hpc/slurm#which_partition_for_my_job|What partition should I run my job on]] ?   * [[hpc/slurm#which_partition_for_my_job|What partition should I run my job on]] ?
-    * This will help you choose the parameters ''--partition''+    * This will help you choose the parameters ''<nowiki>--partition</nowiki>''
   * How much memory does my job need ?   * How much memory does my job need ?
     * This will help you choose the parameters ''<nowiki>--mem</nowiki>'', or ''<nowiki>--mem-per-cpu</nowiki>''      * This will help you choose the parameters ''<nowiki>--mem</nowiki>'', or ''<nowiki>--mem-per-cpu</nowiki>'' 
Line 155: Line 138:
   * Do I want to receive email notification ?   * Do I want to receive email notification ?
  * This is optional, but you can specify the level of details you want with the ''<nowiki>--mail-type</nowiki>'' parameter  * This is optional, but you can specify the level of details you want with the ''<nowiki>--mail-type</nowiki>'' parameter
 +
 +====== Transfer data from cluster to another with ======
 +===== Rsync =====
 +This help assumes you want transfer the directory ''<nowiki>$HOME/my_projects/the_best_project_ever</nowiki>'' from baobab to yggdrasil at the same path. You can adapt your case by changing the variables.
 +
 +
 +__**Rsync options:**__
 +  * ''<nowiki>-a, --archive</nowiki>''This is equivalent to ''<nowiki>-rlptgoD</nowiki>''. It is a quick way of saying you want recursion and want to preserve almost everything (with ''<nowiki>-H</nowiki>'' being a notable omission). The only exception to the above equivalence is when ''<nowiki>--files-from</nowiki>'' is specified, in which case -r is not implied.
 +  * ''<nowiki>-i</nowiki>'' turns on the itemized format, which shows more information than the default format
 +  * ''<nowiki>-b</nowiki>'' makes rsync backup files that exist in both folders, appending ~ to the old file. You can control this suffix with ''<nowiki>--suffix .suf</nowiki>''
 +  * ''<nowiki>-u</nowiki>'' makes rsync transfer skip files which are newer in dest than in src
 +  * ''<nowiki>-z</nowiki>'' turns on compression, which is useful when transferring easily-compressible files over slow links
 +  * ''<nowiki>-P</nowiki>'' turns on --partial and --progress
 +  * ''<nowiki>--partial</nowiki>'' makes rsync keep partially transferred files if the transfer is interrupted
 +  * ''<nowiki>--progress</nowiki>''  shows a progress bar for each transfer, useful if you transfer big files
 +  * ''<nowiki>-n, --dry-run</nowiki>''  perform a trial run with no changes made
 +
 +1) Go to your directory containing ''<nowiki>the_best_project_ever</nowiki>'':
 +<code>
 +(baobab)-[toto@login2 ~]$cd $HOME/my_projects/
 +</code>
 +
 +2) Set the variables (or not)
 +<code>
 +(baobab)-[toto@login2 my_projects]$ DST=$HOME/my_projects/
 +(baobab)-[toto@login2 my_projects]$ DIR=the_best_project_ever
 +(baobab)-[toto@login2 my_projects]$ YGGDRASIL=login1.yggdrasil
 +</code>
 +3) Run the rsync
 +<code>
 +(baobab)-[toto@login2 my_projects]$ rsync -aviuzPrg ${DIR} ${YGGDRASIL}:${DST}
 +</code>
hpc/best_practices.1604596212.txt.gz · Last modified: 2020/11/05 18:10 by Massimo Brero