Imprint | Privacy Policy

How to use Slurm with Saliksik HPC

(Usage hints for this presentation)

COARE Operations

Topics To discuss

  • Slurm Commands to Remember
  • Jobs and Job Steps
  • Serial and Parallel Jobs
  • Sbatch vs Srun
  • Slurm Resource Management
  • Scaling
  • Working with Anaconda
  • Working with Singularity

Slurm Commands to remember

  • sbatch
  • squeue
  • srun
  • scancel
  • sacct
  • sinfo
  • salloc

Serial and Parallel Jobs

Serial Jobs
These jobs run tasks one after another on a single processor.
  • They are suitable for simpler computations that don’t require much processing power or speed.
  • Think of it like doing one task at a time in sequence.
Parallel Jobs
These jobs split a task into smaller sub-tasks that run simultaneously on multiple processors.
  • This approach is ideal for more complex computations that need faster processing and more power.
  • It’s like having several people working on different parts of a project at the same time to get it done faster.

Sbatch vs Srun

sbatch
is for jobs that will be run in later time
  • This command is used to submit a batch script to SLURM
  • Non-blocking
  • schedules the job for later execution
srun
is used to run parallel jobs on a cluster managed by SLURM
  • Interactive
  • blocking

Jobs, Job Steps and Job scripts

Think of using Slurm as if you are preparing a meal.

The meal that you are going to cook is the job.

While the individual steps to accomplish this are your job steps.

This includes unpacking the ingredients, slicing the vegetables, boiling the water, etc..

continuation… PART 1/3

But I’m a busy person…got no time for this :(

Can I outsource this and let the Mecha Chefs (compute nodes) do the cooking for me? :)

continuation… PART 2/3

Well Yes! That’s the purpose of SLURM.

All we need is a text file that contains the necessary ingredients and procedures, we call this the job script.

Let the SLURMagic do it for you!

Continuation… PART 3/3

But I have specific requirements for which Mecha Chef (compute node) should do the job for me…

We need to talk about that!

Slurm Resource Management

  • How does slurm manage resources?
    • Selection of Nodes
    • Allocation of CPUs from the Selected Nodes
    • Distribution of Tasks to the selected Nodes
    • Optional Distribution and Binding of Tasks to CPUs within a Node

What affects the selection?

Option Descripton
-c, –cpus-per-tasksq Controls the number of CPUs allocated per task
–exclusive Prevents sharing of allocated nodes with other jobs. Suballocates CPUs to job steps
-N, –nodes <min[-max]> Controls the minimum/maximum number of nodes allocated to the job
-F, –nodefile File containing a list of specific nodes to be selected for the job (salloc and sbatch only)
–mincpus <n> Controls the minimum number of CPUs allocated per node
   

What affects the selection?

Options Description
–ntasks-per-core Controls the maximum number of tasks per allocated core
–ntasks-per-node <number> Controls the maximum number of tasks per allocated node
-O, –overcommit Allows fewer CPUs to be allocated than the number of tasks
-p, –partition <partitionnames> Controls which partition is used for the job
-w, –nodelist <hostN or filename> List of specific nodes to be allocated to the job
-x, –exclude <host1 or filename> List of specific nodes to be excluded from allocation to the job

Scaling

Resource Availability
Smaller resource requests are more likely to be fulfilled quickly since they fit into available slots on the cluster. Larger requests can take longer to schedule.
Scalability Testing
By starting small, you can test if your job runs successfully. If it does, you can then request more resources incrementally, ensuring you don’t over-allocate.
Resource Optimization
It helps in identifying the minimal resource requirement for your job, preventing waste and ensuring other jobs can also access resources.

Testing Phase

First submission
You request 1 CPU and 2GB of memory to ensure the script runs without errors.
Once successful
You might increase it to 2 CPUs and 4GB of memory to improve performance.
Final request
Once you’re confident, you could scale up to 4 CPUs and 8GB of memory for even faster processing.

Resource Optimization

Start small
Request 2 CPUs and 4GB of memory.
Monitor performance
Check if the job completes successfully and analyze its resource usage.
Adjust accordingly
If the job requires more power, scale up to 8 CPUs and 16GB of memory. Conversely, if it used less than expected, you can scale down for the next run.

Anaconda in HPC 🐍

  • Anaconda work in the same as in your personal/work computer
  • Needs a little bit of configuration select where environments and packages should be stored in a managed disks environment like the saliksik HPC.

Overview

Anaconda a.ka. Conda is a python package manager that simplifies the installation of scientific libraries and environment management.

Configuration

  • Conda stores the environment and packages files in this paths:
    • ~/.conda/envs
    • ~/.conda/pkgs
  • Changing the env and pkgs location

    • Command Line
    conda config --add envs_dirs ~/scratch3/conda/envs
    
    • Manually modify the ~/.condarc

      envs_dirs:
          - ~/scratch3/conda/envs
          - ~/scratch3/conda/pkgs
      
  • You need to manually create the envs and pkgs

    mkdir –p ~/scratch3/conda/envs ~/scratch3/conda/env
    

Why do we wan’t to configure where conda put this files?

  • The saliksik hpc uses multiple volumes to handle the massive storage requirements
  • The home directory has relatively smaller capacity scratch1, scratch2, and particularly the scratch3 volume can hold more data
  • These volumes are linked in your home directory /home/your-username/scratch1 -> /scratch1/your-username Same goes with scratch2 and scratch3.

Conda Channels

Add the ’conda-forge’ and ‘bioconda’ channel as a backup to ’defaults’:

  • conda config –append channels conda-forge
  • conda config –append channels bioconda
  • Its important to keep the defaults since some packages depends on it.

Creating an environment

  • To create an environment named myenv
    • conda create –n myenv
    • conda create –n myenv python3.8 # use python3.8
  • If the envsdirs is not set, the path can be given during the creation

    conda create --prefix ~/scratch3/conda/envs/myenv   
    

Search in the default channel

conda search package_name
conda search numpy
  • Search in a specific channel

    conda search -c conda-forge package_name
    conda search -c bioconda biopython
    

Activate and Install packages

  • If you have an environment named myenv, activate it with
conda activate myenv
  • Once activated, you can install packages with
conda install <package_name>

Install from specific channel,

conda install –c <channel_name> <package_name>

Singularity

What is singularity?

  • Singularity is a container that holds all the software and tools you need to run.
  • It ensures that your program runs the same way, no matter where you run it.
  • The file system inside a singularity container is isolated from the Host OS.

Use case

  • You want to use some libraries that are not in the HPC
  • You want to use a different version of a library
  • You want to some tools that are not in the HPC
  • You want to use a different version of a tool
  • You want a reproducible setup
  • Or maybe there is an existing image that already have what you need

Why use singularity

  • Verifiable reproducibility and security, using cryptographic signatures, an immutable container image format, and in-memory decryption.
  • Integration over isolation by default. Easily make use of GPUs, high speed networks, parallel filesystems on a cluster or server by default.
  • Mobility of compute. The single file SIF container format is easy to transport and share.
  • A simple, effective security model. You are the same user inside a container as outside, and cannot gain additional privilege on the host system by default.
  • Read more about Security in SingularityCE.

Terminologies

Image
this is the blueprint / recipe/ template needed to create an environment. The image itself does not do anything.
Container
it is the running instance of the image capable of executing commands, running processes and other services an OS is capable.
Base Image
minimal operating system image that serves as a starting point for creating new containers.
Build
The process of creating a container image, usually from a set of instructions specified in a file (e.g., a Dockerfile, Singularity Definition file). This includes adding applications, libraries, and necessary files.
Run
The command to start a container from an image. For example, docker run or singularity run in Singularity. It initializes the container’s processes.
Execute
A command to run an additional process inside an already running container. For instance, using docker exec to open a shell in a running container.
SIF (Singularity Image Format)
A format used by Singularity containers. It’s a single file that contains the entire container image, making it easier to distribute and deploy. The .sif extension is used on this file type.
Shell
The command that opens an interactive shell session within the container. For example, docker run -it for Docker or singularity shell for Singularity. This allows you to interact with the container’s environment.

Build Image

  • A container image is built using the command

    singularity build <options> <fiel.sif> <target>
    
  • target is any of the following:
    • URI beginning with library:// to build from the Container Library
    • URI beginning with docker:// to build from Docker Hub
    • URI beginning with shub:// to build from Singularity Hub
    • path to an existing container on your local machine
    • path to a directory to build from a sandbox
    • path to a SingularityCE definition file
  • build can produce containers in two different formats, which can be specified as follows:
    • a compressed read-only Singularity Image File (SIF) format, suitable for production (default)
    • a writable (ch)root directory called a sandbox, for interactive development ( –sandbox option)
  • supported format, you can use it to convert existing containers from one format to another.

Downloading existing container

sudo singularity build lolcow.sif library://lolcow
lolcow.sif
specifies the path and name for your container.
library://lolcow
gives the Container Library URI from which to download.
  • The resulting lolcow.sif is a compressed and read-only SIF.
  • If you want your container in a writable format, use the –sandbox option.
singularity build --sandbox my_sandbox docker://ubuntu:slim
  • Here we also pulled from dockerhub.

Accessing the Shell

sudo singularity shell --writable lolcow/

Converting container format

We can convert containers from one format to another

sudo singularity build production.sif development/
production.sif
the output container in sif format
development/
the sanbox container

Examples

  • singularity –debug run library://lolcow
  • singularity run –containall library://lolcow
  • singularity search tensorflow
  • singularity pull library://lolcow
  • singularity pull docker://sylabsio/lolcow
  • singularity build ubuntu.sif library://ubuntu
  • singularity build lolcow.sif docker://sylabsio/lolcow

Definition file

Example 1: Using all the sections

Bootstrap: library
From: ubuntu:22.04
Stage: build

%setup
    touch /file1
    touch ${SINGULARITY_ROOTFS}/file2

%files
    /file1
    /file1 /opt

%environment
    export LISTEN_PORT=54321
    export LC_ALL=C

%post
    apt-get update && apt-get install -y netcat
    NOW=`date`
    echo "export NOW=\"${NOW}\"" >> $SINGULARITY_ENVIRONMENT
%runscript
    echo "Container was created $NOW"
    echo "Arguments received: $*"
    exec echo "$@"

%startscript
    nc -lp $LISTEN_PORT

%test
    grep -q NAME=\"Ubuntu\" /etc/os-release
    if [ $? -eq 0 ]; then
        echo "Container base is Ubuntu as expected."
    else
        echo "Container base is not Ubuntu."
        exit 1
    fi

%labels
    Author myuser@example.com
    Version v0.0.1

%help
    This is a demo container used to illustrate a def file that uses all
    supported sections.

Example 2: A minimal Example

Bootstrap: docker
From: ubuntu:22.04

%post
    apt-get -y update
    apt-get -y install cowsay lolcat

%environment
    export LC_ALL=C
    export PATH=/usr/games:$PATH

%runscript
    date | cowsay | lolcat
  • To build a SIF container from the definition file
sudo singularity build lolcow.sif lolcow.def
  • Singularity will copy the user inside the shell, so if the user is a normal (non root) from the host then the user would also be a non-root user inside the container.
  • sudo is required here because an elevated privilege is required to install packages.

Gaining root privilege without sudo

  • Singularity supports root like privilege without sudo using --remote and --fakeroot, however --remote requires an enterprise version of singularity
  • Currently it is not possible for Saliksik HPC users to use singularity as superuser
  • Users may file a ticket to request for the installation.

Example 3: Installing PCOMCOT

Bootstrap: library
From: ubuntu:22.04

%files
        ./PCOMCOT/* /opt/pcomcot/


%environment
        export PATH="$PATH:/opt/pcomcot/src_PCOMCOT2.0"

%post
        apt-get update && apt-get upgrade -y
        apt-get install -y build-essential gfortran openmpi-bin openmpi-doc \
        libopenmpi-dev libnetcdf-dev libnetcdff-dev
        make -C /opt/pcomcot/src_PCOMCOT2.0


%labels
        Version: pcomcot-2.0