This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
hpc:faq [2020/11/26 12:31] Yann Sagon [I want to run several time the same job with different parameters...] |
hpc:faq [2023/12/13 08:37] (current) Yann Sagon [Can I run interactive tasks ?] |
||
---|---|---|---|
Line 2: | Line 2: | ||
===== General ===== | ===== General ===== | ||
+ | |||
+ | ==== Which cluster should I use ==== | ||
+ | You can use both clusters, but see [[hpc: | ||
+ | |||
==== I'm lost, where can I find support ? ==== | ==== I'm lost, where can I find support ? ==== | ||
Line 7: | Line 11: | ||
+ | ==== Citation, publication and acknowledgments ==== | ||
+ | |||
+ | Please see https:// | ||
+ | |||
+ | ==== The cluster is slow ==== | ||
+ | This may happen, but the problem is to determine what is slow: | ||
+ | * the login node: You shouldn' | ||
+ | * the compute node (the cpu, the storage) | ||
+ | * the storage (home, scratch, other): in this case, the whole cluster is impacted and your job can run slowly. | ||
+ | |||
+ | What to do: be sure you aren't the cause. Check with '' | ||
+ | |||
+ | ===== Account ===== | ||
+ | |||
+ | ==== When does my account expire ==== | ||
+ | * If you have a non student account (Phd, postdoc, researcher), | ||
+ | * If you have an outsider account, you need to check the expiration date you received when you filled the invitation. | ||
+ | * If you have an unige student account, you can check the expiration date with the '' | ||
+ | < | ||
+ | (baobab)-[yourusername@login2 ~]$ chage -l yourusername | ||
+ | Last password change | ||
+ | Password expires | ||
+ | Password inactive | ||
+ | Account expires | ||
+ | Minimum number of days between password change | ||
+ | Maximum number of days between password change | ||
+ | Number of days of warning before password expires | ||
+ | </ | ||
+ | |||
+ | ==== I'm leaving UNIGE, can I continue to use Baobab HPC service? ==== | ||
+ | Yes it is possible as long as you collaborate tightly with your former research group. Your PI must invite you as [[hpc: | ||
+ | We'll then reactivate your account. You'll keep your data. | ||
===== Storage ===== | ===== Storage ===== | ||
Line 54: | Line 90: | ||
In that case you can use the **job arrays** feature of SLURM. Please, have a look at the documentation [[hpc: | In that case you can use the **job arrays** feature of SLURM. Please, have a look at the documentation [[hpc: | ||
==== What partition should I choose ? ==== | ==== What partition should I choose ? ==== | ||
- | FIXME => in Slurm section | + | See [[hpc/ |
- | + | ||
- | Currently we have four partions: **parallel**, | + | |
- | - If it's just a test, send it to the **debug** partition. It will be limited to 15 minutes, but that will be enough to check if everything starts fine. On the plus side, two nodes are reserved for the debug partition during the day, so you won't need to wait much. | + | |
- | - If your job needs more than 64GB of RAM (ie. more than 4GB per core), you should use the **bigmem** partition. Currently this partition has a only a single node, so check before that your job cannot run on any other node. | + | |
- | - If you have several jobs of less than 12 hours of runtime, use the **shared** partition. This partition is larger than the parallel partition so your jobs may start earlier. This partition is specially recommended for jobs using a single node. | + | |
- | - If your job did not fall into any of the three categories above, use the **parallel** partition. | + | |
==== Can I launch a job longer than 4 days ? ==== | ==== Can I launch a job longer than 4 days ? ==== | ||
- | No. Unfortunately you can't. If we raised this limit, you will have to wait longer before having your pending jobs started. We think that the 4 days limit is a good trade-off.\\ \\ However there could be two work-arounds | + | No. Unfortunately you can't. If we raised this limit, you will have to wait longer before having your pending jobs started. We think that the 4 days limit is a good trade-off.\\ \\ However there could be two work-around |
- | - Some softwares | + | - Some software |
- You could add private notes to Baobab. In that case the limit will be raised to 7 days or even higher. If you are interested, contact us.\\ \\ | - You could add private notes to Baobab. In that case the limit will be raised to 7 days or even higher. If you are interested, contact us.\\ \\ | ||
==== How are the priorities computed ? ==== | ==== How are the priorities computed ? ==== | ||
- | FIXME => link to Slurm | + | See [[hpc:slurm#how_is_the_priority_of_a_job_determined|here]] |
- | FIXME merge this + post from Yann in community | + | |
- | + | To get the priority calculation details of the jobs in the pending queue, you can use the command: '' | |
- | A job's priority depends on several factors. The main are: | + | |
- | - The *faire-share* value of the user submitting the job. This value depends on your past relative usage: the more you've used the cluster, the lower the *faire-share* value. Past usage is forgotten with time, using an exponential decay formula. The actual faire-share computation is a bit hairy and is described in this document: <http://slurm.schedmd.com/ | + | |
- | - The *age* of the job. That value is proportional to the time a job spends in the pending queue. | + | |
- | - The *size* of the job, proportional to the number of node requested. Scheduling large jobs first, allows a better usage of the whole cluster. | + | |
- | - The *partition* used. Users owning private nodes and using the corresponding private partition always a have a greater probability. | + | |
- | Those factors are weighted such as to give the faire-share value the highest impact on the priorities. \\ \\ To get the priority calculation details of the jobs in the pending queue, you can use the command: '' | + | |
==== My jobs stay a long time in the pending queue... ==== | ==== My jobs stay a long time in the pending queue... ==== | ||
- | FIXME + add link to Best practice | + | See |
+ | * [[hpc/ | ||
+ | * [[hpc/ | ||
+ | * [[best_practices# | ||
+ | ==== Can I run interactive tasks ? ==== | ||
- | FIXME copy post best use of resource from Yann in Community + merge with this | ||
- | If you find that the wait time of your jobs is too large, you can\\ try several strategies to see your code executed before: | + | Yes, you can. But it is really awkward because |
- | - Be sure that your job will consume all 16 cores of each node.\\ If it's not the case, you can run simulteanously several jobs on the same node. For instance, if your computation uses only 2 threads, you can run 8 of these on a node. [http:// | + | |
- | - Shorter jobs are often scheduled earlier. Try to set the walltime limit as close as possible to the real execution time of your computation. | + | |
- | - If the walltime of your jobs is below 12h, you can use the '' | + | |
- | - You can buy private nodes. They will be dedicated to your computations with high priority. | + | |
- | ==== Can I run interactive tasks ? ==== | + | See [[hpc/ |
- | FIXME ==> Yes => link to Slurm doc + access ?? | + | ==== I'm not able to use all the cores of a compute node ==== |
+ | Indeed, we are reserving two cores per node for system tasks such as data transfer, and os stuff. | ||
- | Yes, you can. But it is really awkward because you cannot be sure when your job will start:\\ \\ To request a full single node for interactive usage, use the command\\ | + | < |
+ | (yggdrasil)-[root@admin1 ~]$ scontrol show node cpu001 | ||
+ | NodeName=cpu001 Arch=x86_64 CoresPerSocket=18 | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | </ | ||
+ | |||
+ | If you really need to use all the cores of a compute node, you can override this parameter: | ||
+ | |||
+ | ref: https://slurm.schedmd.com/ | ||
===== Troubleshooting ===== | ===== Troubleshooting ===== | ||
+ | ==== Check ssh key ==== | ||
+ | If you connect to the cluster with an ssh key as authentication mechanism and you have trouble: | ||
+ | |||
+ | * check you have the correct private key: '' | ||
+ | | ||
==== Illegal instruction ==== | ==== Illegal instruction ==== | ||
- | If you run a program and it crashes with an error ''" | + | If you run a program and it crashes with an error ''" |
- | N.B. : login1 was running CentOS6, as of August 2019 all compute nodes run CentOS7 | + | you have compiled your program on Baobab login node and your program is running on an older compute node |
+ | on which the CPU lacks some specialized functionality that were used during the compilation. | ||
You have two possibilities: | You have two possibilities: | ||
- | - Recompile your program with less optimization | + | - Recompile your program with less optimization, or compile on an older node. See [[hpc: |
- Only run your program on newer servers. See [[hpc: | - Only run your program on newer servers. See [[hpc: | ||