User Tools

Site Tools


hpc:faq

This is an old revision of the document!


FAQ - Frequently asked questions

General

I'm lost, where can I find support ?

Since you are reading this FAQ, we suppose you already know how to find the documentation. If you need any further explanation, advice, tips, etc. contact us at: hpc@unige.ch . We will try to answer your request as soon as possible.

Storage

Where should I store my files ? What should I do if I deleted something by mistake ? Is there a backup ? How can I restore a delete file ? What amount of storage space is available ? My job creates lots of temporary small files and everything is slow…

Please check the Storage page for details.

Alternatively, if you need to store a large quantity of data, you could use another service such as the “Academic NAS” : https://catalogue-si.unige.ch/en/stockage-recherche

Applications

What applications are installed on Baobab ?

You can find information about available applications here

Can you install the software XYZ on Baobab ?

Please check this documentation.

Can I use any Microsoft Windows software ?

Baobab is a GNU/Linux only machine, like the majority of academic clusters. If you have a windows software that could run on a Windows cluster, contact us at hpc@unige.ch, perhaps we could find some solutions.

Can I use a proprietary licensed software ?

Yes we can install it, but you should pay the required license. Send us a request at hpc@unige.ch.

I need a different Linux distributions/version, am I stuck ?

No, please check the Singularity documentation.

Running jobs (SLURM)

I am already familiar with ''torque/pbs/sge/lsf/...'', what are the equivalent concepts in slurm ?

Have a look at this scheduler “rosetta stone”, available here:
http://slurm.schedmd.com/rosetta.pdf

Can I run some small test runs in the login node ?

No never. You must use SLURM to run any test. The debug partition is dedicated to small tests.

I want to run several time the same job with different parameters...

In that case you can use the job arrays feature of SLURM. Please, have a look at the documentation Job array

What partition should I choose ?

See here

Can I launch a job longer than 4 days ?

No. Unfortunately you can't. If we raised this limit, you will have to wait longer before having your pending jobs started. We think that the 4 days limit is a good trade-off.

However there could be two work-arounds if you experience an issue with this limit:

  1. Some softwares feature *checkpointing*. During runtime, the program will periodically save its current state on the disks. In that case, this snapshot may be used to resume the computation by another job. Check if your program allows checkpointing. If you cannot find the information, try contacting the develloper or ask us at hpc@unige.ch.
  2. You could add private notes to Baobab. In that case the limit will be raised to 7 days or even higher. If you are interested, contact us.

How are the priorities computed ?

See here

To get the priority calculation details of the jobs in the pending queue, you can use the command: sprio -w. You can also have a look at the weights, by typing sprio -l.

My jobs stay a long time in the pending queue...

FIXME + add link to Best practice

FIXME copy post best use of resource from Yann in Community + merge with this

If you find that the wait time of your jobs is too large, you can
try several strategies to see your code executed before:

  1. Be sure that your job will consume all 16 cores of each node.
    If it's not the case, you can run simulteanously several jobs on the same node. For instance, if your computation uses only 2 threads, you can run 8 of these on a node. [http://baobabmaster.unige.ch/enduser/src/enduser/enduser.html#monothread-jobs]
  2. Shorter jobs are often scheduled earlier. Try to set the walltime limit as close as possible to the real execution time of your computation.
  3. If the walltime of your jobs is below 12h, you can use the shared partition which is larger than parallel.
  4. You can buy private nodes. They will be dedicated to your computations with high priority.

Can I run interactive tasks ?

FIXME =⇒ Yes ⇒ link to Slurm doc + access ??

Yes, you can. But it is really awkward because you cannot be sure when your job will start:

To request a full single node for interactive usage, use the command
salloc -N1. This command will block your shell until a node can be allocated. It will then output:

salloc: Granted job allocation 114544

And a sub-shell will be spawned. You can then run shell command in the
allocated node by prefixing them with srun. Attention: If you
forget the srun your job will run on the login node instead, which
terribly impairs the cluster experience.

Troubleshooting

Illegal instruction

If you run a program and it crashes with an error “Illegal instruction” the reason is probably because you have compiled your program on login1 and your program is running on an older server on which the cpu lacks some specialized functionality that were used during the compilation on login1.
N.B. : login1 was running CentOS6, as of August 2019 all compute nodes run CentOS7

You have two possibilities:

  1. Recompile your program with less optimization
  2. Only run your program on newer servers. See Specify the CPU type you want and Compute nodes.
hpc/faq.1606390581.txt.gz · Last modified: 2020/11/26 12:36 by Yann Sagon