This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
hpc:faq [2020/11/26 12:41] Yann Sagon [My jobs stay a long time in the pending queue...] |
hpc:faq [2023/12/13 08:37] (current) Yann Sagon [Can I run interactive tasks ?] |
||
---|---|---|---|
Line 2: | Line 2: | ||
===== General ===== | ===== General ===== | ||
+ | |||
+ | ==== Which cluster should I use ==== | ||
+ | You can use both clusters, but see [[hpc: | ||
+ | |||
==== I'm lost, where can I find support ? ==== | ==== I'm lost, where can I find support ? ==== | ||
Line 7: | Line 11: | ||
+ | ==== Citation, publication and acknowledgments ==== | ||
+ | |||
+ | Please see https:// | ||
+ | |||
+ | ==== The cluster is slow ==== | ||
+ | This may happen, but the problem is to determine what is slow: | ||
+ | * the login node: You shouldn' | ||
+ | * the compute node (the cpu, the storage) | ||
+ | * the storage (home, scratch, other): in this case, the whole cluster is impacted and your job can run slowly. | ||
+ | |||
+ | What to do: be sure you aren't the cause. Check with '' | ||
+ | |||
+ | ===== Account ===== | ||
+ | |||
+ | ==== When does my account expire ==== | ||
+ | * If you have a non student account (Phd, postdoc, researcher), | ||
+ | * If you have an outsider account, you need to check the expiration date you received when you filled the invitation. | ||
+ | * If you have an unige student account, you can check the expiration date with the '' | ||
+ | < | ||
+ | (baobab)-[yourusername@login2 ~]$ chage -l yourusername | ||
+ | Last password change | ||
+ | Password expires | ||
+ | Password inactive | ||
+ | Account expires | ||
+ | Minimum number of days between password change | ||
+ | Maximum number of days between password change | ||
+ | Number of days of warning before password expires | ||
+ | </ | ||
+ | |||
+ | ==== I'm leaving UNIGE, can I continue to use Baobab HPC service? ==== | ||
+ | Yes it is possible as long as you collaborate tightly with your former research group. Your PI must invite you as [[hpc: | ||
+ | We'll then reactivate your account. You'll keep your data. | ||
===== Storage ===== | ===== Storage ===== | ||
Line 56: | Line 92: | ||
See [[hpc/ | See [[hpc/ | ||
==== Can I launch a job longer than 4 days ? ==== | ==== Can I launch a job longer than 4 days ? ==== | ||
- | No. Unfortunately you can't. If we raised this limit, you will have to wait longer before having your pending jobs started. We think that the 4 days limit is a good trade-off.\\ \\ However there could be two work-arounds | + | No. Unfortunately you can't. If we raised this limit, you will have to wait longer before having your pending jobs started. We think that the 4 days limit is a good trade-off.\\ \\ However there could be two work-around |
- | - Some softwares | + | - Some software |
- You could add private notes to Baobab. In that case the limit will be raised to 7 days or even higher. If you are interested, contact us.\\ \\ | - You could add private notes to Baobab. In that case the limit will be raised to 7 days or even higher. If you are interested, contact us.\\ \\ | ||
Line 73: | Line 109: | ||
==== Can I run interactive tasks ? ==== | ==== Can I run interactive tasks ? ==== | ||
- | FIXME ==> Yes => link to Slurm doc + access ?? | ||
- | Yes, you can. But it is really awkward because you cannot be sure when your job will start:\\ \\ To request | + | Yes, you can. But it is really awkward because you cannot be sure when your job will start. |
+ | |||
+ | See [[hpc/ | ||
+ | |||
+ | ==== I'm not able to use all the cores of a compute node ==== | ||
+ | Indeed, we are reserving two cores per node for system tasks such as data transfer, and os stuff. | ||
+ | |||
+ | < | ||
+ | (yggdrasil)-[root@admin1 ~]$ scontrol show node cpu001 | ||
+ | NodeName=cpu001 Arch=x86_64 CoresPerSocket=18 | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | </ | ||
+ | |||
+ | If you really need to use all the cores of a compute node, you can override this parameter: | ||
+ | |||
+ | ref: https:// | ||
===== Troubleshooting ===== | ===== Troubleshooting ===== | ||
+ | ==== Check ssh key ==== | ||
+ | If you connect to the cluster with an ssh key as authentication mechanism and you have trouble: | ||
+ | |||
+ | * check you have the correct private key: '' | ||
+ | | ||
==== Illegal instruction ==== | ==== Illegal instruction ==== | ||
- | If you run a program and it crashes with an error ''" | + | If you run a program and it crashes with an error ''" |
- | N.B. : login1 was running CentOS6, as of August 2019 all compute nodes run CentOS7 | + | you have compiled your program on Baobab login node and your program is running on an older compute node |
+ | on which the CPU lacks some specialized functionality that were used during the compilation. | ||
You have two possibilities: | You have two possibilities: | ||
- | - Recompile your program with less optimization | + | - Recompile your program with less optimization, or compile on an older node. See [[hpc: |
- Only run your program on newer servers. See [[hpc: | - Only run your program on newer servers. See [[hpc: | ||