How to use Slurm with Saliksik HPC

(Usage hints for this presentation)

COARE Operations

Topics To discuss

Slurm Commands to Remember
Jobs and Job Steps
Serial and Parallel Jobs
Sbatch vs Srun
Slurm Resource Management
Scaling
Working with Anaconda
Working with Singularity

Slurm Commands to remember

sbatch
squeue
srun
scancel

sacct
sinfo
salloc

Serial and Parallel Jobs

Serial Jobs

These jobs run tasks one after another on a single processor.

They are suitable for simpler computations that don’t require much processing power or speed.
Think of it like doing one task at a time in sequence.

Parallel Jobs

These jobs split a task into smaller sub-tasks that run simultaneously on multiple processors.

This approach is ideal for more complex computations that need faster processing and more power.
It’s like having several people working on different parts of a project at the same time to get it done faster.

Sbatch vs Srun

sbatch

is for jobs that will be run in later time

This command is used to submit a batch script to SLURM
Non-blocking
schedules the job for later execution

srun

is used to run parallel jobs on a cluster managed by SLURM

Interactive
blocking

Jobs, Job Steps and Job scripts

Think of using Slurm as if you are preparing a meal.

The meal that you are going to cook is the job.

While the individual steps to accomplish this are your job steps.

This includes unpacking the ingredients, slicing the vegetables, boiling the water, etc..

continuation… PART 1/3

But I’m a busy person…got no time for this :(

Can I outsource this and let the Mecha Chefs (compute nodes) do the cooking for me? :)

continuation… PART 2/3

Well Yes! That’s the purpose of SLURM.

All we need is a text file that contains the necessary ingredients and procedures, we call this the job script.

Let the SLURMagic do it for you!

Continuation… PART 3/3

But I have specific requirements for which Mecha Chef (compute node) should do the job for me…

We need to talk about that!

Slurm Resource Management

How does slurm manage resources?
- Selection of Nodes
- Allocation of CPUs from the Selected Nodes
- Distribution of Tasks to the selected Nodes
- Optional Distribution and Binding of Tasks to CPUs within a Node

What affects the selection?

Option	Descripton
-c, –cpus-per-tasksq	Controls the number of CPUs allocated per task
–exclusive	Prevents sharing of allocated nodes with other jobs. Suballocates CPUs to job steps
-N, –nodes <min[-max]>	Controls the minimum/maximum number of nodes allocated to the job
-F, –nodefile	File containing a list of specific nodes to be selected for the job (salloc and sbatch only)
–mincpus <n>	Controls the minimum number of CPUs allocated per node

What affects the selection?

Options	Description
–ntasks-per-core	Controls the maximum number of tasks per allocated core
–ntasks-per-node <number>	Controls the maximum number of tasks per allocated node
-O, –overcommit	Allows fewer CPUs to be allocated than the number of tasks
-p, –partition <partition_names>	Controls which partition is used for the job
-w, –nodelist <hostN or filename>	List of specific nodes to be allocated to the job
-x, –exclude <host1 or filename>	List of specific nodes to be excluded from allocation to the job

Scaling

Resource Availability: Smaller resource requests are more likely to be fulfilled quickly since they fit into available slots on the cluster. Larger requests can take longer to schedule.

Scalability Testing: By starting small, you can test if your job runs successfully. If it does, you can then request more resources incrementally, ensuring you don’t over-allocate.

Resource Optimization: It helps in identifying the minimal resource requirement for your job, preventing waste and ensuring other jobs can also access resources.

Testing Phase

First submission: You request 1 CPU and 2GB of memory to ensure the script runs without errors.
Once successful: You might increase it to 2 CPUs and 4GB of memory to improve performance.
Final request: Once you’re confident, you could scale up to 4 CPUs and 8GB of memory for even faster processing.

Resource Optimization

Start small: Request 2 CPUs and 4GB of memory.
Monitor performance: Check if the job completes successfully and analyze its resource usage.
Adjust accordingly: If the job requires more power, scale up to 8 CPUs and 16GB of memory. Conversely, if it used less than expected, you can scale down for the next run.

Anaconda in HPC 🐍

Anaconda work in the same as in your personal/work computer
Needs a little bit of configuration select where environments and packages should be stored in a managed disks environment like the saliksik HPC.

Overview

Anaconda a.ka. Conda is a python package manager that simplifies the installation of scientific libraries and environment management.

Configuration

Conda stores the environment and packages files in this paths:
- ~/.conda/envs
- ~/.conda/pkgs

Changing the env and pkgs location

Command Line

conda config --add envs_dirs ~/scratch3/conda/envs

Manually modify the ~/.condarc

envs_dirs:
    - ~/scratch3/conda/envs
    - ~/scratch3/conda/pkgs

You need to manually create the envs and pkgs

mkdir –p ~/scratch3/conda/envs ~/scratch3/conda/env

Why do we wan’t to configure where conda put this files?

The saliksik hpc uses multiple volumes to handle the massive storage requirements
The home directory has relatively smaller capacity scratch1, scratch2, and particularly the scratch3 volume can hold more data
These volumes are linked in your home directory /home/your-username/scratch1 -> /scratch1/your-username Same goes with scratch2 and scratch3.

Conda Channels

This are the remote repositories where the packages can be downloaded -Ex. Anaconda(default), conda-forge, bioconda
conda-forge : https://conda-forge.org/packages/
Bioconda : https://bioconda.github.io/conda-package_index.htm

Add the ’conda-forge’ and ‘bioconda’ channel as a backup to ’defaults’:

conda config –append channels conda-forge
conda config –append channels bioconda
Its important to keep the defaults since some packages depends on it.

Creating an environment

To create an environment named myenv
- conda create –n myenv
- conda create –n myenv python3.8 # use python3.8
If the envs_dirs is not set, the path can be given during the creation
```
conda create --prefix ~/scratch3/conda/envs/myenv   
```

Search in the default channel

conda search package_name
conda search numpy

Search in a specific channel

conda search -c conda-forge package_name
conda search -c bioconda biopython

Activate and Install packages

If you have an environment named myenv, activate it with

conda activate myenv

Once activated, you can install packages with

conda install <package_name>

Install from specific channel,

conda install –c <channel_name> <package_name>

Singularity

What is singularity?

Singularity is a container that holds all the software and tools you need to run.
It ensures that your program runs the same way, no matter where you run it.
The file system inside a singularity container is isolated from the Host OS.

Use case

You want to use some libraries that are not in the HPC
You want to use a different version of a library
You want to some tools that are not in the HPC
You want to use a different version of a tool
You want a reproducible setup
Or maybe there is an existing image that already have what you need

Why use singularity

Verifiable reproducibility and security, using cryptographic signatures, an immutable container image format, and in-memory decryption.
Integration over isolation by default. Easily make use of GPUs, high speed networks, parallel filesystems on a cluster or server by default.
Mobility of compute. The single file SIF container format is easy to transport and share.
A simple, effective security model. You are the same user inside a container as outside, and cannot gain additional privilege on the host system by default.
Read more about Security in SingularityCE.

Terminologies

Image: this is the blueprint / recipe/ template needed to create an environment. The image itself does not do anything.
Container: it is the running instance of the image capable of executing commands, running processes and other services an OS is capable.

Base Image: minimal operating system image that serves as a starting point for creating new containers.
Build: The process of creating a container image, usually from a set of instructions specified in a file (e.g., a Dockerfile, Singularity Definition file). This includes adding applications, libraries, and necessary files.

Run: The command to start a container from an image. For example, docker run or singularity run in Singularity. It initializes the container’s processes.
Execute: A command to run an additional process inside an already running container. For instance, using docker exec to open a shell in a running container.

SIF (Singularity Image Format): A format used by Singularity containers. It’s a single file that contains the entire container image, making it easier to distribute and deploy. The .sif extension is used on this file type.
Shell: The command that opens an interactive shell session within the container. For example, docker run -it for Docker or singularity shell for Singularity. This allows you to interact with the container’s environment.

Build Image

A container image is built using the command

singularity build <options> <fiel.sif> <target>

target is any of the following:
- URI beginning with library:// to build from the Container Library
- URI beginning with docker:// to build from Docker Hub
- URI beginning with shub:// to build from Singularity Hub
- path to an existing container on your local machine
- path to a directory to build from a sandbox
- path to a SingularityCE definition file

build can produce containers in two different formats, which can be specified as follows:
- a compressed read-only Singularity Image File (SIF) format, suitable for production (default)
- a writable (ch)root directory called a sandbox, for interactive development ( –sandbox option)
supported format, you can use it to convert existing containers from one format to another.

Downloading existing container

sudo singularity build lolcow.sif library://lolcow

lolcow.sif: specifies the path and name for your container.
library://lolcow: gives the Container Library URI from which to download.

The resulting lolcow.sif is a compressed and read-only SIF.
If you want your container in a writable format, use the –sandbox option.

singularity build --sandbox my_sandbox docker://ubuntu:slim

Here we also pulled from dockerhub.

Accessing the Shell

sudo singularity shell --writable lolcow/

Converting container format

We can convert containers from one format to another

sudo singularity build production.sif development/

production.sif: the output container in sif format
development/: the sanbox container

Examples

singularity –debug run library://lolcow
singularity run –containall library://lolcow
singularity search tensorflow
singularity pull library://lolcow
singularity pull docker://sylabsio/lolcow
singularity build ubuntu.sif library://ubuntu
singularity build lolcow.sif docker://sylabsio/lolcow

Definition file

Example 1: Using all the sections

Bootstrap: library
From: ubuntu:22.04
Stage: build

%setup
    touch /file1
    touch ${SINGULARITY_ROOTFS}/file2

%files
    /file1
    /file1 /opt

%environment
    export LISTEN_PORT=54321
    export LC_ALL=C

%post
    apt-get update && apt-get install -y netcat
    NOW=`date`
    echo "export NOW=\"${NOW}\"" >> $SINGULARITY_ENVIRONMENT

%runscript
    echo "Container was created $NOW"
    echo "Arguments received: $*"
    exec echo "$@"

%startscript
    nc -lp $LISTEN_PORT

%test
    grep -q NAME=\"Ubuntu\" /etc/os-release
    if [ $? -eq 0 ]; then
        echo "Container base is Ubuntu as expected."
    else
        echo "Container base is not Ubuntu."
        exit 1
    fi

%labels
    Author myuser@example.com
    Version v0.0.1

%help
    This is a demo container used to illustrate a def file that uses all
    supported sections.

Example 2: A minimal Example

Bootstrap: docker
From: ubuntu:22.04

%post
    apt-get -y update
    apt-get -y install cowsay lolcat

%environment
    export LC_ALL=C
    export PATH=/usr/games:$PATH

%runscript
    date | cowsay | lolcat

To build a SIF container from the definition file

sudo singularity build lolcow.sif lolcow.def

Singularity will copy the user inside the shell, so if the user is a normal (non root) from the host then the user would also be a non-root user inside the container.
sudo is required here because an elevated privilege is required to install packages.

Gaining root privilege without sudo

Singularity supports root like privilege without sudo using --remote and --fakeroot, however --remote requires an enterprise version of singularity
Currently it is not possible for Saliksik HPC users to use singularity as superuser
Users may file a ticket to request for the installation.

Example 3: Installing PCOMCOT

Bootstrap: library
From: ubuntu:22.04

%files
        ./PCOMCOT/* /opt/pcomcot/


%environment
        export PATH="$PATH:/opt/pcomcot/src_PCOMCOT2.0"

%post
        apt-get update && apt-get upgrade -y
        apt-get install -y build-essential gfortran openmpi-bin openmpi-doc \
        libopenmpi-dev libnetcdf-dev libnetcdff-dev
        make -C /opt/pcomcot/src_PCOMCOT2.0


%labels
        Version: pcomcot-2.0