Features

Singularity

Singularity is a container platform, designed specifically for scientific computing and high-performance computing (HPC) environments. Unlike Docker, it focuses on integration with existing HPC workflows. Key features include unprivileged container execution, allowing users to build and run containers without root access. Container definitions are written in a simple syntax using definition files, making it accessible to scientists without extensive containerization expertise. Containers are built from .def file formats into .sif file formats. Inside the .def files, Docker container repositories can be referenced, therefore Docker containers can be converted directly into Singularity containers during the build process. Singularity automatically mounts user directories (volume binding) and other system resources, facilitating seamless integration with existing HPC environments. The platform maintains compatibility with traditional HPC schedulers like Slurm and enables MPI applications to run across multiple nodes. Singularity containers use the host system's kernel and drivers, reducing overhead and security risks. The runtime supports GPU computing and accelerators through native integration with NVIDIA CUDA and other frameworks. The platform includes features for reproducible science, allowing researchers to package entire software stacks and dependencies. Major research institutions and supercomputing centers widely adopt Singularity for containerized scientific workloads. Note: The main reason why we introduced Singularity, is that users can define their own dependencies and access GPU resources without root privileges.

# Check general Singularity CE and Docker CE information
singularity --version

Singularity pull process: Docker images can be pulled with the help of .def files. The .def file can be used to build a .sif file which can be run as a container or be called inside a Slurm job for code execution:

singularity pull tensorflow.sif docker://tensorflow/tensorflow:2.14.0-gpu

cat pytorch.def
Bootstrap: docker
From: pytorch/pytorch:2.1.2-cuda11.8-cudnn8-runtime

export SINGULARITY_CACHEDIR=~/
export SINGULARITY_TMPDIR=~/
singularity build --fakeroot pytorch.sif pytorch.def

Use the --fakeroot option if you aim to build the image without sudo privileges. Note: The current Nvidia Grid solution of the Slurm PaaS only supports libraries compatible with cuda version 11.8.

Note: In some cases the build process might fail if there is a _ character inside the .sif file, so please try to avoid using it.

Singularity Job examples:

To run code in inside a containerized environment, first create a .batch file and use the #BATCH parameters according to your needs, then save the file and send it to the scheduler.

nano test_job.batch
sbatch test_job.batch

Note: In order to use GPUs, the #SBATCH --gres=gpu:nvidia:1 parameter must be used inside the .batch file.

Singularity environment, without allocated GPUs:

#!/bin/bash
#SBATCH --job-name=tensorflow_job      # Job name
#SBATCH --output=tensorflow_job.out    # Standard output log file
#SBATCH --error=tensorflow_job.err     # Standard error log file
#SBATCH --ntasks=1                     # Number of tasks (typically 1)
#SBATCH --nodes=1
#SBATCH --time=01:00:00                 # Maximum runtime of the job (hh:mm:ss)
#SBATCH --partition=batch_cpu_m2.large  # Specify the partition/queue name


# Run a script inside the Singularity container:
srun singularity exec tensorflow.sif python3 tensorflow_code.py

Singularity environment, with allocated GPUs:

#!/bin/bash
#SBATCH --job-name=tensorflow_job      # Job name
#SBATCH --output=tensorflow_job.out    # Standard output log file
#SBATCH --error=tensorflow_job.err     # Standard error log file
#SBATCH --ntasks=1                     # Number of tasks (typically 1)
#SBATCH --nodes=1
#SBATCH --time=01:00:00                 # Maximum runtime of the job (hh:mm:ss)
#SBATCH --partition=batch_gpu_g2.large  # Specify the partition/queue name
#SBATCH --gres=gpu:nvidia:1             # Use GPU for execution


# Run a script inside the Singularity container:
srun singularity exec tensorflow.sif python3 tensorflow_code.py

Singularity - Shell mode The following commands are examples to access the container shell directly from the slurm-master node, while emulating a session on a worker node:

srun -p batch_gpu_g2.large_8 --gres=gpu:1 --pty singularity shell pytorch.sif

For volume binding the following command is used in Singularity

srun -p batch_gpu_g2.xlarge_16 --gres=gpu:1 --pty singularity shell --bind $(pwd):/workspace pytorch.sif

INFO:    Setting 'NVIDIA_VISIBLE_DEVICES=all' to emulate legacy GPU binding.
INFO:    Setting --writable-tmpfs (required by nvidia-container-cli)
Singularity>

srun -p batch_gpu_g2.xlarge_16 --gres=gpu:1 --pty singularity shell tensorflow.sif

INFO:    Setting 'NVIDIA_VISIBLE_DEVICES=all' to emulate legacy GPU binding.
INFO:    Setting --writable-tmpfs (required by nvidia-container-cli)
Singularity>

Note: By default Singularity is deployed with a configuration by which the container uses the allocated GPU. The --nv and --nci flags are allowed but not necessary. The configuration above will allocate one GPU to each node with the use of the #SBATCH --gres=gpu:nvidia:1 parameter.

For more details, please visit the List of useful Slurm commands section.

For additional details regarding the Singularity Community Edition OCI utilization, please refer to the official documentation

Programs	Supported versions
Cuda	11.0 - 12.3
Pytorch	1.12.0 - 2.2.x
Jupyterlab	3.0 - 4.x
Tensorflow	2.10.0 - 2.15.x
Singularity	4.2.2

Note: The currently used NVIDIA GRID version is 550.144.03, please use compatible programming libraries and tools.

OMP

OpenMP (Open Multi-Processing) is a specification for parallel programming that provides a set of compiler directives, library routines, and environment variables to support shared-memory multiprocessing. It enables programmers to incrementally parallelize existing sequential code by adding pragmas that instruct the compiler how to distribute work across multiple threads. OpenMP follows a fork-join execution model where a master thread creates a team of parallel threads at designated parallel regions in the code. It is particularly well-suited for multicore and multiprocessor systems where all threads have direct access to the same memory space. OpenMP's popularity stems from its relatively simple implementation that requires minimal code changes while providing significant performance improvements for many applications.

#!/bin/bash
#SBATCH --job-name=omp_job       # Job name
#SBATCH --nodes=1                # Run on a single node
#SBATCH --ntasks=1               # Run a single task
#SBATCH --cpus-per-task=4        # Use 4 CPU cores
#SBATCH --mem=4G                 # Request 4 GB of memory
#SBATCH --time=00:30:00          # Time limit: 30 minutes
#SBATCH --output=%j_output.log   # Standard output log
#SBATCH --error=%j_error.log     # Standard error log

# Set the number of OpenMP threads to match requested CPUs
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Compile the OpenMP program (if needed)
gcc -fopenmp mp.c -o mp

# Run the program
srun ./mp

MPI (Message Passing Interface)

MPI is a standardized communication protocol for parallel computing. It defines a set of routines enabling data exchange and synchronization between processes in distributed memory systems. MPI supports point-to-point communication through send/receive operations and collective communications like broadcast and gather. The standard defines process groups and communicators, allowing flexible process organization. MPI implementations handle routing, buffering, and data conversion automatically. Common implementations include OpenMPI and MPICH. Notable features include:

Blocking and non-blocking communication modes
Derived data types for complex data structures
Process topologies for optimization
One-sided communication operations
Thread safety levels for hybrid parallelism

MPI remains essential in HPC applications, especially in scientific computing and numerical simulations. It enables efficient scaling across thousands of processors while maintaining portability across different architectures and vendors. The standard continues evolving, with MPI-4.0 adding features for modern hardware and programming models while maintaining backward compatibility. In the current project, OpenMPI and PMIx (an extension library) implementation was made. (Simple Linux Utility for Resource Management)

MPI
OpenMPI is the open-source implementation of MPI. To compile an MPI program (written in C in this case), use the mpicc -o test_code test_code.c . After compilation, you can schedule the written code against the slurm cluster with the srun -n 4 -c 1 ./mpi_hello command, where the -n flag sets the number of processes and the -c flag the number of cores for each process.

For additional details regarding the OpenMPI utilization, please refer to the official documentation.

#!/bin/bash
#SBATCH --job-name=mpi_2n_job
#SBATCH --output=mpi_2n.out
#SBATCH --error=mpi_2n.err
#SBATCH --ntasks-per-node=2      # 2 tasks per node
#SBATCH --nodes=2                # Using 2 nodes
#SBATCH --gres=gpu:nvidia:1      # 1 GPU per node
#SBATCH --partition=batch_gpu_g2.large_8  # Specify GPU partition

# Compile the OpenMPI program
mpicc -o mpi_code mpi_code.c

# Run MPI script with 4 total processes (2 per node)
srun -n 4 ./mpi_code

Programs	Supported versions
GCC	13.3.0
OpenMP	4.5
Open MPI	5.0.5

MPI4PY: To run run Python code through OpenMPI, the user does not need any compilation beforehand. The following command can be inserted inside a .batch file for job scheduling:

#SBATCH --job-name=mpi4py_job      # Job name
#SBATCH --output=mpi4py_job.out    # Standard output log file
#SBATCH --error=mpi4py_job.err     # Standard error log file
srun -n <number of tasks> -c <CPU cores per task>  <python_code.py>

Note: MPI4PY currently can only be scheduled for one node, in case of a multi-node parameter scheduling, the job remains in a PENDING loop state.