Differences

This shows you the differences between two versions of the page.

--- hpc:faq [2020/11/26 12:35]
Yann Sagon [How are the priorities computed ?]
+++ hpc:faq [2023/12/13 08:37] (current)
Yann Sagon [Can I run interactive tasks ?]
@@ Line 2: / Line 2: @@
 ===== General =====
+==== Which cluster should I use ====
+You can use both clusters, but see [[hpc:hpc_clusters#the_clustersbaobab_and_yggdrasil|this link]] to help you choose the right cluster.
 ==== I'm lost, where can I find support ? ====
@@ Line 7: / Line 11: @@
+==== Citation, publication and acknowledgments ====
+Please see https://www.unige.ch/eresearch/en/services/hpc/terms-use/.
+==== The cluster is slow ====
+This may happen, but the problem is to determine what is slow:
+  * the login node: You shouldn't be running any job on the login node. Maybe another user is doing that and he is using all the cpus. In this, only the login node is slow, not the jobs running on the compute nodes.
+  * the compute node (the cpu, the storage)
+  * the storage (home, scratch, other): in this case, the whole cluster is impacted and your job can run slowly.
+What to do: be sure you aren't the cause. Check with ''htop'' on the login node. If you see that all the cpus are in use, please take a screenshot and send it to us at hpc@unige.ch.
+===== Account =====
+==== When does my account expire ====
+  * If you have a non student account (Phd, postdoc, researcher), your account will expire at the same time your contract expire at UNIGE. Right now, there is a grace period after the end of your contract of around 6 months.
+  * If you have an outsider account, you need to check the expiration date you received when you filled the invitation.
+  * If you have an unige student account, you can check the expiration date with the ''chage'' command:
+<code>
+(baobab)-[yourusername@login2 ~]$ chage -l yourusername
+Last password change                                    : Apr 01, 2022
+Password expires                                        : never
+Password inactive                                       : never
+Account expires                                         : never
+Minimum number of days between password change          : 0
+Maximum number of days between password change          : 99999
+Number of days of warning before password expires       : 7
+</code>
+==== I'm leaving UNIGE, can I continue to use Baobab HPC service? ====
+Yes it is possible as long as you collaborate tightly with your former research group. Your PI must invite you as [[hpc:access_the_hpc_clusters#outsider_account|outsider]]. For technical reason, your account needs to be expired prior doing the request for the invitation.
+We'll then reactivate your account. You'll keep your data.
 ===== Storage =====
@@ Line 56: / Line 92: @@
 See [[hpc/slurm#which_partition_for_my_job|here]]
 ==== Can I launch a job longer than 4 days ? ====
-No. Unfortunately you can't. If we raised this limit, you will have to wait longer before having your pending jobs started. We think that the 4 days limit is a good trade-off.\\ \\ However there could be two work-arounds if you experience an issue with this limit:
+No. Unfortunately you can't. If we raised this limit, you will have to wait longer before having your pending jobs started. We think that the 4 days limit is a good trade-off.\\ \\ However there could be two work-around if you experience an issue with this limit:
-  - Some softwares feature *checkpointing*. During runtime, the program will periodically save its current state on the disks. In that case, this snapshot may be used to resume the computation by another job.  Check if your program allows checkpointing. If you cannot find the information, try contacting the develloper or ask us at [[hpc@unige.ch]].
+  - Some software feature *checkpointing*. During runtime, the program will periodically save its current state on the disks. In that case, this snapshot may be used to resume the computation by another job.  Check if your program allows checkpointing. If you cannot find the information, try contacting the developer or ask us at [[hpc@unige.ch]].
   - You could add private notes to Baobab. In that case the limit will be raised to 7 days or even higher. If you are interested, contact us.\\ \\
@@ Line 63: / Line 99: @@
 See [[hpc:slurm#how_is_the_priority_of_a_job_determined|here]]
-To get the priority calculation details of the jobs in the pending queue, you can use the command: ''sprio -l''. You can also have a look at the weights, by typing =sprio -w\".\\ \\
+To get the priority calculation details of the jobs in the pending queue, you can use the command: ''<nowiki>sprio -w</nowiki>''. You can also have a look at the weights, by typing ''<nowiki>sprio -l</nowiki>''.
 ==== My jobs stay a long time in the pending queue... ====
-FIXME + add link to Best practice
+See
+  * [[hpc/slurm#which_partition_for_my_job|Which partition for my job]]
+  * [[hpc/slurm#job_priorities|Job priorities]]
+  * [[best_practices#stop_wasting_resources|Stop wasting resources]]
+==== Can I run interactive tasks ? ====
-FIXME copy post best use of resource from Yann in Community + merge with this
-If you find that the wait time of your jobs is too large, you can\\ try several strategies to see your code executed before:
+Yes, you can. But it is really awkward because you cannot be sure when your job will start.
-  - Be sure that your job will consume all 16 cores of each node.\\ If it's not the case, you can run simulteanously several jobs on the same node. For instance, if your computation uses only 2 threads, you can run 8 of these on a node. [http://baobabmaster.unige.ch/enduser/src/enduser/enduser.html#monothread-jobs]
-  - Shorter jobs are often scheduled earlier. Try to set the walltime limit as close as possible to the real execution time of your computation.
-  - If the walltime of your jobs is below 12h, you can use the ''shared'' partition which is larger than parallel.
-  - You can buy private nodes. They will be dedicated to your computations with high priority.
-==== Can I run interactive tasks ? ====
+See [[hpc/slurm#interactive_jobs|Interactive jobs]]
-FIXME ==> Yes => link to Slurm doc + access ??
+==== I'm not able to use all the cores of a compute node ====
+Indeed, we are reserving two cores per node for system tasks such as data transfer, and os stuff.
-Yes, you can. But it is really awkward because you cannot be sure when your job will start:\\ \\ To request a full single node for interactive usage, use the command\\ ''salloc -N1''. This command will block your shell until a node can be allocated. It will then output:\\ \\ ''salloc: Granted job allocation 114544''\\ \\ And a sub-shell will be spawned. You can then run shell command in the\\ allocated node by prefixing them with ''srun''. **Attention:** If you\\ forget the ''srun'' your job will run on the login node instead, which\\ terribly impairs the cluster experience.\\
+<code>
+(yggdrasil)-[root@admin1 ~]$ scontrol show node cpu001
+NodeName=cpu001 Arch=x86_64 CoresPerSocket=18
+   CPUAlloc=0 CPUEfctv=34 CPUTot=36 CPULoad=0.01
+   AvailableFeatures=GOLD-6240,XEON_GOLD_6240,V9
+   ActiveFeatures=GOLD-6240,XEON_GOLD_6240,V9
+   Gres=(null)
+   NodeAddr=cpu001 NodeHostName=cpu001 Version=23.02.1
+   OS=Linux 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Tue May 16 11:38:37 UTC 2023
+   RealMemory=187000 AllocMem=0 FreeMem=185338 Sockets=2 Boards=1
+   CoreSpecCount=2 CPUSpecList=17,35 <==================== this means we have two specialization cores <<<<
+   State=IDLE ThreadsPerCore=1 TmpDisk=150000 Weight=10 Owner=N/A MCS_label=N/A
+   Partitions=debug-cpu
+   BootTime=2023-08-10T12:08:11 SlurmdStartTime=2023-08-10T12:09:00
+   LastBusyTime=2023-08-11T10:06:42 ResumeAfterTime=None
+   CfgTRES=cpu=34,mem=187000M,billing=34
+   AllocTRES=
+   CapWatts=n/a
+   CurrentWatts=0 AveWatts=0
+   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
+</code>
+If you really need to use all the cores of a compute node, you can override this parameter: ''--core-spec=0''. This will implicitly lead to an exclusive allocation of the node.
+ref: https://slurm.schedmd.com/core_spec.html
 ===== Troubleshooting =====
+==== Check ssh key ====
+If you connect to the cluster with an ssh key as authentication mechanism and you have trouble:
+  * check you have the correct private key: ''ssh-keygen -y -f id_rsa | cut -d' ' -f 2'' should correspond to ''cut -d' ' -f 2 id_rsa.pub''
 ==== Illegal instruction ====
-If you run a program and it crashes with an error ''"Illegal instruction"'' the reason is probably because you have compiled your program on login1 and your program is running on an older server on which the cpu lacks some specialized functionality that were used during the compilation on login1. \\
+If you run a program and it crashes with an error ''"Illegal instruction"'' the reason is probably because
-N.B. : login1 was running CentOS6, as of August 2019 all compute nodes run CentOS7
+you have compiled your program on Baobab login node and your program is running on an older compute node
+on which the CPU lacks some specialized functionality that were used during the compilation.
 You have two possibilities:
-  - Recompile your program with less optimization
+  - Recompile your program with less optimization, or compile on an older node. See [[hpc:hpc_clusters#for_advanced_users|Advanced users]]
   - Only run your program on newer servers. See [[hpc:slurm#specify_the_cpu_type_you_want|Specify the CPU type you want]] and [[hpc:hpc_clusters#compute_nodes|Compute nodes]].

eResearch Doc

User Tools

Site Tools

Differences

Page Tools