The Slurm job scheduler used on the Andromeda HPC Cluster has a “preemption” feature to allow for a running job to be preempted by a new submitted job when that new job meets certain criteria. Andromeda now has a special partition where preemption has been enabled to allow group with dedicated resources to show those resources with the community when they’re idle, which still maintaining priority access to those resources for themselves.
*As of June, 2025, only the Slurm jobs submitted to the “preempt” partition can be preempted by jobs submitted to the dedicated partition belonging to the PI/group the dedicated hardware is for.
How Preemption Works:
Let’s say our test user “johnchris” wants to submit a job to the cluster when there are no nodes available in the time based partitions, but there are idle resources in the “preempt” partition. Or the scenario might be that our test user specifically wants to test out a job on special hardware that was purchased by a PI and isn’t available on the community nodes in the time based partitions. Our test user can submit the job to the “preempt” partition:
sbatch -p preempt -t 0-04:00:00 test.sl NOTE: The maximum time that can be requested is 4 hours.
Assuming the cores, memory and/or GPUs specified in test.sl are available the job will start on the first available node in the “preempt” partition with those resources.
What happens if the PI or their group submits a job to their dedicated partition?
This is where the preemption happens, the running that johnchris submitted would be “requeued” to wait for resources to be available again, allowing the newly submitted job in the dedicated partition to start quickly.
NOTE: There is a 5 minute grace time to allow the running job in the “preempt” partition to see that it is about to be preempted and perform any necessary checkpointing to be able to restart as close to where it left off as possible when idle resources are available again.
What is checkpointing?
Checkpointing is any process used to save a runt time state of a job so it can be restarted as close to where it left off as possible.
Some software will automatically restart where it left off. Some software need to received a certain “signal” to know when to checkpoint or may need to be launched a certain way to enable the feature. Some software cannot use checkpointing directly, requiring custom wrapper scripts to perform necessary tasks. Other times checkpointing may not be possible at all. For assistance, please submit a request via https://bc.edu/ressearchhelp.
How to Submit jobs to the “preempt” partition:
srun -p preempt -N1 -n 1 c 4 -t 0-04:00:00 --pty $SHELL OR:
sbatch -p preempt myjob.sl OR add the following at the top of your batch script:
#SBATCH -p preempt OR via the interactive command:
interactive -p preempt