Here, we will explore more advanced features of Slurm workloads, including array jobs, multithreading with OpenMP and OpenMPI.
Compute clusters such as Andromeda are comprised of many compute nodes, which is essentially a single computer. A node has access to a collection of CPU cores (typically 48 or 64) as well as a pool of RAM memory (typically 185GB or 250GB) and sometimes secondary resources (such as GPUs).
When a program is started, the requested cores and memory are reserved by Slurm and the user’s code is executed. For example, the system will reserve a single core and some memory for this simple program:
By default, “plain” code is executed in serial; the processor executes individual instructions one-at-a-time in sequence. In this scenario, the execution time for the program is estimated by the number of instructions divided by the CPU clock speed. Generally, the clock speed of a compute cluster is not significantly different from that of a laptop processor; thus, to see a significant speedup, you will need to distribute your computational tasks across multiple threads (or even nodes).
Most programming languages have built-in options for parallelization. For instance, for-loops in MATLAB can be replaced with parfor-loops, which can be used to execute loop iterations in parallel. It is important to keep in mind that parallel executions are implemented in multiple ways, such as:
- By running multiple instances of the same singly-threaded program, perhaps with different initial conditions. This is the so-called “embarrassingly parallel” paradigm. It typically requires minimum adaptation from singly-threaded can and be effectively executed with Slurm using job arrays, detailed below.
- By running a multi-threaded program. This way, multiple threads can share access to memory. Typically, code must be written with multi-threading in mind using, for example, OpenMP.
- By running a multi-process program, i.e. with MPI or the multiprocessing package provided by Python.
In OpenMP, threads share the same resources and access shared memory. On the other hand, with MPI, each process has its own memory space and executes independently from the other processes.
Parallel programs may utilize multiple processes or multiple threads (or both). A process, also known as a task, is simply an instance of a program being executed. A process has access to a number of threads, which share the resources allocated by the parent process. A multi-threaded program is a single task which uses several threads. By contrast, an embarrassingly parallel job has many tasks, and each task uses only a single thread. A single task cannot be split across multiple compute nodes, although you may be able to split your job across multiple nodes with multiprocessing. It is possible, however, for tasks across multiple nodes to communicate with each other using OpenMPI, which is provided as a module on andromeda with ‘module load openmpi’.