Job Priority – Fair-share – ITS / Research Computing

The Slurm job schedule uses a multi-factor system to calculate a job’s priority in the queue to ensure equitable and fair access to the cluster’s shared resources. It’s not a FIFO (first in, first out) system. Instead, the the job owner’s recent activity on the cluster (fair-share value), the resources the job needs and how long the job needs to run for are all weighed proportionately to set an overall priority value for the job that determines its position in the queue. This prioritized queue is the augmented by the backfill plugin, which tries to find gaps in resource allocations where smaller, shorter jobs can squeue in without delaying the expected start time of the other jobs in the queue.

*If you’re looking to give your job the best chance of being scheduled as soon as possible, you should request just a little more time and resources than your job will need. It’s very important that your jobs are submitted requesting only slightly more time than they need, if you’re hoping to have the backfill plugin find a gap where you job might fit and get run a little early.

The more jobs you submit within a 1-2 week period the lower your fair-share value will be relative to the “billing” weight of the jobs that were submitted. If few or no jobs are submitted, your fair-share value will go back up over time. Each pending job’s priority is recalculated a few times per hour to reflect the job owner’s fair-share value changes. And any jobs pending in the queue will accrue an age factor that raises the jobs priority the longer it remains in a pending state.

Job Priority and Fairshare Breakdown

1. Priority Key Factors

The priority of a job is determined by several sub-factors, including:

Fair-Share Factor: This reflects the user’s share of resources relative to their historical usage.
Age Factor: This gives higher priority to jobs that have been waiting in the queue longer.
Job Size Factor: This considers the number of nodes or processors a job requests.
Partition Factor: This takes into account the partition (or queue) the job is submitted to, which may have different priorities.

2. Priority Formula

SLURM uses the following formula to calculate the priority for each job:

Job_priority = (PriorityWeightAge) * (age_factor) + (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightJobSize) * (job_size_factor) + (PriorityWeightPartition) * (partition_factor) + (PriorityWeightQOS) * (QOS_factor) + (possibly some more advanced factors that are not relevant for Andromeda)

All the factors in these formulas are floating point numbers between 0.0 and 1.0, while the weights are integer values that determine how important these factors should be considered.

3. Job Priority Calculation on Andromeda

On Andromeda, we use the following weights in our SLURM configuration:

[johnchris@andromeda ~]$ sprio -w

JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS Weights 1 500000 1000000 200000 100000 750000

This means that the priority of a job is mainly determined by Fairshare, waiting age, and job size, with a smaller influence from the node partition.

SLURM job priorities can be queried using the sprio utility. Below, the -S ‘-Y’ option sorts by priority in descending order:

[johnchris@andromeda ~]$ sprio -S '-Y'

JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS 573999 interacti 1101225 0 500000 500849 377 100000 0 574047 interacti 1101225 0 500000 500849 377 100000 0 574223 interacti 1101225 0 500000 500849 377 100000 0 624515 long 584841 0 182270 302207 364 100000 0 626296 long 582244 0 96481 385399 364 100000 0 626297 long 582244 0 96481 385399 364 100000 0 626298 long 582244 0 96481 385399 364 100000 0 626299 long 582244 0 96481 385399 364 100000 0

We can see that Job ID 626296 has priority 582244 = 96481 (AGE) + 385399 (Fairshare) + 364 (JOBSIZE) + 100000 (PARTITION)).

4. Backfill

In addition to the main scheduling cycle, jobs are run based on priority and resource availability. All jobs are also considered for “backfill.” Backfill allows lower-priority jobs to start before high-priority jobs if they can fit around them. For example, SLURM will run a low-priority job if it just needs a couple of cores for an hour, while a high-priority job that needs 20 nodes with 48 cores each will have to wait 26 hours for those resources to become available.

5. Fairshare

Fairshare is a scheduling policy designed to ensure equitable access to computing resources among users or groups. It balances resource allocation based on historical usage and predefined priorities.

Our cluster system is not a FIFO (first in, first service) system. Instead, it ensures equal access to computing resources for all groups and users over time, providing fair resources to all users.

So be careful when running jobs and remember to only request what you need (i.e., CPU cores, memory, GPUs, and time). Allocating excess cluster resources will negatively affect the job priority of your subsequent jobs.

For more information, please visit the following links.

https://slurm.schedmd.com/classic_fair_share.html

https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf