User Tools

Site Tools


hpc:faq

This is an old revision of the document!


FAQ - Frequently asked questions

General

I'm lost, where can I find support ?

Since you are reading this FAQ, we suppose you already know how to find the documentation. If you need any further explanation, advice, tips, etc. contact us at: hpc@unige.ch . We will try to answer your request as soon as possible.

Storage

Where should I store my files ? What should I do if I deleted something by mistake ? Is there a backup ? How can I restore a delete file ? What amount of storage space is available ? My job creates lots of temporary small files and everything is slow…

Please check the Storage page for details.

Alternatively, if you need to store a large quantity of data, you could use another service such as the “Academic NAS” : https://catalogue-si.unige.ch/en/stockage-recherche

Applications

What applications are installed on Baobab ?

You can find information about available applications here

Can you install the software XYZ on Baobab ?

Please check this documentation.

Can I use any Microsoft Windows software ?

Baobab is a GNU/Linux only machine, like the majority of academic clusters. If you have a windows software that could run on a Windows cluster, contact us at hpc@unige.ch, perhaps we could find some solutions.

Can I use a proprietary licensed software ?

Yes we can install it, but you should pay the required license. Send us a request at hpc@unige.ch.

I need a different Linux distributions/version, am I stuck ?

No, please check the Singularity documentation.

Running jobs (SLURM)

I am already familiar with ''torque/pbs/sge/lsf/...'', what are the equivalent concepts in slurm ?

Have a look at this scheduler “rosetta stone”, available here:
http://slurm.schedmd.com/rosetta.pdf

Can I run some small test runs in the login node ?

No never. You must use SLURM to run any test. The debug partition is dedicated to small tests.

I want to run several time the same job with different parameters...

In that case you can use the job arrays feature of SLURM. Please, have a look at the documentation Job array

What partition should I choose ?

FIXME ⇒ in Slurm section

Currently we have four partions: parallel, shared, bigmem, and debug. The criteria to choose the right partition are:

  1. If it's just a test, send it to the debug partition. It will be limited to 15 minutes, but that will be enough to check if everything starts fine. On the plus side, two nodes are reserved for the debug partition during the day, so you won't need to wait much.
  2. If your job needs more than 64GB of RAM (ie. more than 4GB per core), you should use the bigmem partition. Currently this partition has a only a single node, so check before that your job cannot run on any other node.
  3. If you have several jobs of less than 12 hours of runtime, use the shared partition. This partition is larger than the parallel partition so your jobs may start earlier. This partition is specially recommended for jobs using a single node.
  4. If your job did not fall into any of the three categories above, use the parallel partition.

Can I launch a job longer than 4 days ?

No. Unfortunately you can't. If we raised this limit, you will have to wait longer before having your pending jobs started. We think that the 4 days limit is a good trade-off.

However there could be two work-arounds if you experience an issue with this limit:

  1. Some softwares feature *checkpointing*. During runtime, the program will periodically save its current state on the disks. In that case, this snapshot may be used to resume the computation by another job. Check if your program allows checkpointing. If you cannot find the information, try contacting the develloper or ask us at hpc@unige.ch.
  2. You could add private notes to Baobab. In that case the limit will be raised to 7 days or even higher. If you are interested, contact us.

How are the priorities computed ?

FIXME ⇒ link to Slurm FIXME merge this + post from Yann in community

A job's priority depends on several factors. The main are:

  1. The *faire-share* value of the user submitting the job. This value depends on your past relative usage: the more you've used the cluster, the lower the *faire-share* value. Past usage is forgotten with time, using an exponential decay formula. The actual faire-share computation is a bit hairy and is described in this document: <http://slurm.schedmd.com/priority_multifactor.html#fairshare>
  2. The *age* of the job. That value is proportional to the time a job spends in the pending queue.
  3. The *size* of the job, proportional to the number of node requested. Scheduling large jobs first, allows a better usage of the whole cluster.
  4. The *partition* used. Users owning private nodes and using the corresponding private partition always a have a greater probability.

Those factors are weighted such as to give the faire-share value the highest impact on the priorities.

To get the priority calculation details of the jobs in the pending queue, you can use the command: sprio -l. You can also have a look at the weights, by typing =sprio -w\“.

My jobs stay a long time in the pending queue...

FIXME + add link to Best practice

FIXME copy post best use of resource from Yann in Community + merge with this

If you find that the wait time of your jobs is too large, you can
try several strategies to see your code executed before:

  1. Be sure that your job will consume all 16 cores of each node.
    If it's not the case, you can run simulteanously several jobs on the same node. For instance, if your computation uses only 2 threads, you can run 8 of these on a node. [http://baobabmaster.unige.ch/enduser/src/enduser/enduser.html#monothread-jobs]
  2. Shorter jobs are often scheduled earlier. Try to set the walltime limit as close as possible to the real execution time of your computation.
  3. If the walltime of your jobs is below 12h, you can use the shared partition which is larger than parallel.
  4. You can buy private nodes. They will be dedicated to your computations with high priority.

Can I run interactive tasks ?

FIXME =⇒ Yes ⇒ link to Slurm doc + access ??

Yes, you can. But it is really awkward because you cannot be sure when your job will start:

To request a full single node for interactive usage, use the command
salloc -N1. This command will block your shell until a node can be allocated. It will then output:

salloc: Granted job allocation 114544

And a sub-shell will be spawned. You can then run shell command in the
allocated node by prefixing them with srun. Attention: If you
forget the srun your job will run on the login node instead, which
terribly impairs the cluster experience.

Troubleshooting

Illegal instruction

If you run a program and it crashes with an error “Illegal instruction” the reason is probably because you have compiled your program on login1 and your program is running on an older server on which the cpu lacks some specialized functionality that were used during the compilation on login1.
N.B. : login1 was running CentOS6, as of August 2019 all compute nodes run CentOS7

You have two possibilities:

  1. Recompile your program with less optimization
  2. Only run your program on newer servers. See Specify the CPU type you want and Compute nodes.
hpc/faq.1606390260.txt.gz · Last modified: 2020/11/26 12:31 by Yann Sagon