The Slurm job schedule uses a multi-factor system to calculate a job’s priority in the queue to ensure equitable and fair access to the cluster’s shared resources. It’s not a FIFO (first in, first out) system. Instead, the the job owner’s recent activity on the cluster (fair-share value), the resources the job needs and how long the job needs to run for are all weighed proportionately to set an overall priority value for the job that determines its position in the queue. This prioritized queue is the augmented by the backfill plugin, which tries to find gaps in resource allocations where smaller, shorter jobs can squeue in without delaying the expected start time of the other jobs in the queue.
*If you’re looking to give your job the best chance of being scheduled as soon as possible, you should request just a little more time and resources than your job will need. It’s very important that your jobs are submitted requesting only slightly more time than they need, if you’re hoping to have the backfill plugin find a gap where you job might fit and get run a little early.
The more jobs you submit within a 1-2 week period the lower your fair-share value will be relative to the “billing” weight of the jobs that were submitted. If few or no jobs are submitted, your fair-share value will go back up over time. Each pending job’s priority is recalculated a few times per hour to reflect the job owner’s fair-share value changes. And any jobs pending in the queue will accrue an age factor that raises the jobs priority the longer it remains in a pending state.
Job Priority and Fairshare Breakdown
1. Priority Key Factors
The priority of a job is determined by several sub-factors, including:
- Fair-Share Factor: This reflects the user’s share of resources relative to their historical usage.
- Age Factor: This gives higher priority to jobs that have been waiting in the queue longer.
- Job Size Factor: This considers the number of nodes or processors a job requests.
- Partition Factor: This takes into account the partition (or queue) the job is submitted to, which may have different priorities.
2. Priority Formula
SLURM uses the following formula to calculate the priority for each job:
Job_priority =
(PriorityWeightAge) * (age_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor) +
(possibly some more advanced factors that are not relevant for Andromeda)
All the factors in these formulas are floating point numbers between 0.0 and 1.0, while the weights are integer values that determine how important these factors should be considered.
3. Job Priority Calculation on Andromeda
On Andromeda, we use the following weights in our SLURM configuration:
[johnchris@l001 ~]$ sprio -w
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION
Weights 1 500000 1000000 250000 1000
PriorityWeightAge = 500000
PriorityWeightFairShare = 1000000
PriorityWeightJobSize = 250000
PriorityWeightPartition = 1000
This means that the priority of a job is mainly determined by Fairshare, waiting age, and job size, with a smaller influence from the node partition.
SLURM job priorities can be queried using the sprio utility. Below, the -S ‘-Y’ option sorts by priority in descending order:
[johnchris@l001 ~]$ sprio -S '-Y'
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION
2458888 shared 1045915 0 500000 544348 568 1000
2486756 exclusive 536587 0 500000 22609 12979 1000
2486757 exclusive 536587 0 500000 22609 12979 1000
2486759 exclusive 536587 0 500000 22609 12979 1000
2486760 exclusive 536587 0 500000 22609 12979 1000
2486761 exclusive 536587 0 500000 22609 12979 1000
2486818 exclusive 534228 0 500000 22609 10619 1000
We can see that Job ID 2458888 has the highest priority (1045915 = 500000 (AGE) + 544348 (Fairshare) + 568 (JOBSIZE) + 1000 (PARTITION)).
4. Backfill
In addition to the main scheduling cycle, jobs are run based on priority and resource availability. All jobs are also considered for “backfill.” Backfill allows lower-priority jobs to start before high-priority jobs if they can fit around them. For example, SLURM will run a low-priority job if it just needs a couple of cores for an hour, while a high-priority job that needs 20 nodes with 48 cores each will have to wait 26 hours for those resources to become available.
5. Fairshare
Fairshare is a scheduling policy designed to ensure equitable access to computing resources among users or groups. It balances resource allocation based on historical usage and predefined priorities.
Our cluster system is not a FIFO (first in, first service) system. Instead, it ensures equal access to computing resources for all groups and users over time, providing fair resources to all users.
So be careful when running jobs and remember to only request what you need (i.e., CPU cores, memory, GPUs, and time). Allocating excess cluster resources will negatively affect the job priority of your subsequent jobs.
For more information, please visit the following links.
https://slurm.schedmd.com/classic_fair_share.html
https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf