User Tools

Site Tools


hpc:faq

FAQ - Frequently asked questions

General

Which cluster should I use

You can use both clusters, but see this link to help you choose the right cluster.

I'm lost, where can I find support ?

Since you are reading this FAQ, we suppose you already know how to find the documentation. If you need any further explanation, advice, tips, etc. contact us at: hpc@unige.ch . We will try to answer your request as soon as possible.

Citation, publication and acknowledgments

The cluster is slow

This may happen, but the problem is to determine what is slow:

  • the login node: You shouldn't be running any job on the login node. Maybe another user is doing that and he is using all the cpus. In this, only the login node is slow, not the jobs running on the compute nodes.
  • the compute node (the cpu, the storage)
  • the storage (home, scratch, other): in this case, the whole cluster is impacted and your job can run slowly.

What to do: be sure you aren't the cause. Check with htop on the login node. If you see that all the cpus are in use, please take a screenshot and send it to us at hpc@unige.ch.

Account

When does my account expire

  • If you have a non student account (Phd, postdoc, researcher), your account will expire at the same time your contract expire at UNIGE. Right now, there is a grace period after the end of your contract of around 6 months.
  • If you have an outsider account, you need to check the expiration date you received when you filled the invitation.
  • If you have an unige student account, you can check the expiration date with the chage command:
(baobab)-[yourusername@login2 ~]$ chage -l yourusername
Last password change                                    : Apr 01, 2022
Password expires                                        : never
Password inactive                                       : never
Account expires                                         : never
Minimum number of days between password change          : 0
Maximum number of days between password change          : 99999
Number of days of warning before password expires       : 7

I'm leaving UNIGE, can I continue to use Baobab HPC service?

Yes it is possible as long as you collaborate tightly with your former research group. Your PI must invite you as outsider. For technical reason, your account needs to be expired prior doing the request for the invitation. We'll then reactivate your account. You'll keep your data.

Storage

Where should I store my files ? What should I do if I deleted something by mistake ? Is there a backup ? How can I restore a delete file ? What amount of storage space is available ? My job creates lots of temporary small files and everything is slow…

Please check the Storage page for details.

Alternatively, if you need to store a large quantity of data, you could use another service such as the “Academic NAS” : https://catalogue-si.unige.ch/en/stockage-recherche

Applications

What applications are installed on Baobab ?

You can find information about available applications here

Can you install the software XYZ on Baobab ?

Please check this documentation.

Can I use any Microsoft Windows software ?

Baobab is a GNU/Linux only machine, like the majority of academic clusters. If you have a windows software that could run on a Windows cluster, contact us at hpc@unige.ch, perhaps we could find some solutions.

Can I use a proprietary licensed software ?

Yes we can install it, but you should pay the required license. Send us a request at hpc@unige.ch.

I need a different Linux distributions/version, am I stuck ?

No, please check the Singularity documentation.

Running jobs (SLURM)

I am already familiar with ''torque/pbs/sge/lsf/...'', what are the equivalent concepts in slurm ?

Have a look at this scheduler “rosetta stone”, available here:
http://slurm.schedmd.com/rosetta.pdf

Can I run some small test runs in the login node ?

No never. You must use SLURM to run any test. The debug partition is dedicated to small tests.

I want to run several time the same job with different parameters...

In that case you can use the job arrays feature of SLURM. Please, have a look at the documentation Job array

What partition should I choose ?

See here

Can I launch a job longer than 4 days ?

No. Unfortunately you can't. If we raised this limit, you will have to wait longer before having your pending jobs started. We think that the 4 days limit is a good trade-off.

However there could be two work-around if you experience an issue with this limit:

  1. Some software feature *checkpointing*. During runtime, the program will periodically save its current state on the disks. In that case, this snapshot may be used to resume the computation by another job. Check if your program allows checkpointing. If you cannot find the information, try contacting the developer or ask us at hpc@unige.ch.
  2. You could add private notes to Baobab. In that case the limit will be raised to 7 days or even higher. If you are interested, contact us.

How are the priorities computed ?

See here

To get the priority calculation details of the jobs in the pending queue, you can use the command: sprio -w. You can also have a look at the weights, by typing sprio -l.

My jobs stay a long time in the pending queue...

Can I run interactive tasks ?

Yes, you can. But it is really awkward because you cannot be sure when your job will start.

See Interactive jobs

I'm not able to use all the cores of a compute node

Indeed, we are reserving two cores per node for system tasks such as data transfer, and os stuff.

(yggdrasil)-[root@admin1 ~]$ scontrol show node cpu001
NodeName=cpu001 Arch=x86_64 CoresPerSocket=18
   CPUAlloc=0 CPUEfctv=34 CPUTot=36 CPULoad=0.01
   AvailableFeatures=GOLD-6240,XEON_GOLD_6240,V9
   ActiveFeatures=GOLD-6240,XEON_GOLD_6240,V9
   Gres=(null)
   NodeAddr=cpu001 NodeHostName=cpu001 Version=23.02.1
   OS=Linux 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Tue May 16 11:38:37 UTC 2023
   RealMemory=187000 AllocMem=0 FreeMem=185338 Sockets=2 Boards=1
   CoreSpecCount=2 CPUSpecList=17,35 <==================== this means we have two specialization cores <<<<
   State=IDLE ThreadsPerCore=1 TmpDisk=150000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=debug-cpu
   BootTime=2023-08-10T12:08:11 SlurmdStartTime=2023-08-10T12:09:00
   LastBusyTime=2023-08-11T10:06:42 ResumeAfterTime=None
   CfgTRES=cpu=34,mem=187000M,billing=34
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

If you really need to use all the cores of a compute node, you can override this parameter: –core-spec=0. This will implicitly lead to an exclusive allocation of the node.

ref: https://slurm.schedmd.com/core_spec.html

Troubleshooting

Check ssh key

If you connect to the cluster with an ssh key as authentication mechanism and you have trouble:

  • check you have the correct private key: ssh-keygen -y -f id_rsa | cut -d' ' -f 2 should correspond to cut -d' ' -f 2 id_rsa.pub

Illegal instruction

If you run a program and it crashes with an error “Illegal instruction” the reason is probably because you have compiled your program on Baobab login node and your program is running on an older compute node on which the CPU lacks some specialized functionality that were used during the compilation.

You have two possibilities:

  1. Recompile your program with less optimization, or compile on an older node. See Advanced users
  2. Only run your program on newer servers. See Specify the CPU type you want and Compute nodes.
hpc/faq.txt · Last modified: 2023/12/13 08:37 by Yann Sagon