Andromeda Cluster Q&A – ITS / Research Computing

1. How do I check my usage quota on the cluster?

To check your usage quota, log into the cluster and run the “acct-chk” script. This will display your login details, Slurm account information, and disk usage for your personal directories (/home/$userid and /scratch/$userid) as well as your group directory (/projects/$sharedproject). You can view a sample output at Checking Storage Quotas.

For Faculty (PIs):
To identify large files in your project folder, use the following command:

find /projects/$sharedproject -size +20G -ls

This will list all files larger than 20 GB in the /projects/$sharedproject directory.

2. What does it mean if I receive an email about exceeding the storage quota?

If you receive an email like the one below:

—-----------------------

IMPORTANT: Jobs submitted to Andromeda 2 will likely fail and access to OOD.bc.edu may be rejected if this situation is not resolved immediately.

PATH            USED      SOFT LIMIT  HARD LIMIT  USAGE %  OWNER  GRACE PERIOD  TIME OVER SOFT LIMIT  STATUS

/projects:/johnchris 75.73 TB  75 TB    85 TB       100       7d 0:00:00h   8:57:21h              ACTIVE

—-----------

This means your storage usage has exceeded the soft quota in one or more directories.

The above example shows in /projects/johnchris, he has used 75.37 TB, which is 0.37 TB over the soft quota of 75 TB. The hard quota is set at 85 TB.

You are allowed to use up to 85 TB during a 2-week grace period, during which you will receive daily notifications from us.

Action Required:

The /home directory has a fixed quota of 50 GB, intended for personal data storage. This limit cannot be increased. To manage overages, please delete unnecessary files or relocate them to /projects/$sharedproject for shared access, or to /scratch/$userid, which is temporary and not backed up.

For storage overages in /projects or /scratch, please submit a request through the RS TD ticket system. Before submitting, ensure that any unnecessary data has been removed.

/projects: After reviewing your data requirements, we will either increase the soft storage limit to 80% of the hard limit or adjust the hard limit directly.

/scratch: This space is meant for users to copy temporary files generated as a result of jobs (often, there is not enough space to store these files in home directories). It’s especially important to clean up unused files in this directory first, as this space is not backed up and files will not be cleared for the user. If additional space is still needed, submit a quota increase request via the RS TD ticket system. Based on your needs, we will either raise the soft limit to 80% of the hard limit or adjust the hard limit accordingly.

3. What if I receive an email saying my home directory quota prevents login?

This means you’ve exceeded the 50 GB quota in your /home directory, and login access is blocked.

To resolve the issue, submit a TD ticket to request restoration of cluster access, and once access is restored, delete unnecessary data, move excess data to /projects/$sharedproject or /scratch/$userid.

Tip: If you’ve installed large packages (e.g., Python libraries) in /home, consider reinstalling them in /projects/$sharedproject or /scratch/$userid to prevent future quota issues.

4. What does “Invalid account or account/partition combination specified” mean?

The error message “Invalid account or account/partition combination specified” typically means that the account or partition referenced in your Slurm job submission is not valid.

In most cases—about 99%—this is due to an incorrect partition name. You should check your Slurm script for the correct partition declaration, such as #SBATCH –partition=somepartition, and ensure you’re using one of the time based nodes partition provided by Andromeda.

In rarer cases, the issue may stem from your account setup. If your account is not properly configured or lacks access to certain partitions, you’ll need to contact us through a TD ticket to resolve it.

Another possible reason for this error is that your account may not have permission to access the specified partition. Some partitions are private and consist of dedicated nodes owned by individual faculty members or research groups.

To find useful information about the available partitions on the cluster, you can use the sinfo command, which displays a list of partitions along with their status and accessibility.

5. i Accidentally deleted my files. Is there a way to recover the files?

If your files are stored in your home directory or project directory, the files will be available for recovery from the system backups. Please check the “Snapshots” section on the Weka File System page for details about how to find files in the backup. Note that files stored in the scratch directory have no backup and hence deletion will be irreversible.

6. I need My Terminal Open for an extended period for my job. what should i do?

You should use tmux whenever you’re running long-running or interactive jobs on a Linux cluster, especially over SSH. It helps keep your session alive even if your network connection drops, ensuring that your processes continue running uninterrupted. With tmux, you can detach from a session and reattach later, allowing you to resume work exactly where you left off. It also supports multiple terminal windows and panes within a single session, making it easier to multitask and manage complex workflows remotely. For example, if you’re transferring a large amount of data using rsync, tmux ensures the transfer continues even if your SSH session is interrupted. For more details, please take a look at How to Apply tmux on Andromeda.

7. I can SSH into the cluster using my BC Eagle ID via the terminal, but I’m unable to log in to ood.bc.edu (Open OnDemand). I keep getting an ‘incorrect password’ error. How can I resolve this?

There are two ways to access Andromeda. The first is through the terminal using SSH to connect to andromeda.bc.edu. The second is through a web-based interface using Open OnDemand at ood.bc.edu.

When logging in for the first time, your initial step should be to connect to andromeda.bc.edu via the terminal using your BC Eagle ID. Once logged in, you must change your password. This step is required before attempting to access Open OnDemand at ood.bc.edu.

Access to Open OnDemand will not work until your password has been changed, as initial passwords are set to expire immediately and must be updated on the login node first.

8. How many cores can I submit per user?

Each user can submit up to 3600 cores total. If your needs exceed this limit, please contact us through the RS TD ticket system. We’re happy to work with users to find solutions.

9. Why is my job in the PD (Pending) state?

Jobs may remain in the PD (pending) state for several reasons. Example output:

When you run:

squeue | grep PD

You might see output like:

…..
574223     interacti     interact    johnchris  PD 0:00 1 (QOSMaxMemoryPerUser)
750253_[1-100] medium   dg_int    gaoqc PD       0:00   1 (QOSMaxCpuPerUserLimit)
….
….

This indicates your job is pending and waiting to be scheduled. Here are the most common reasons why:

Common reasons:

ReqNodeNotAvail: Requested Node Not Available

Explanation: The node you requested is currently unavailable, often due to scheduled maintenance or being reserved on Andromeda

Details: If your job’s requested walltime exceeds the time remaining before maintenance begins, SLURM will hold it to prevent interruption. It will resume once the maintenance window ends.

Resources – Insufficient CPUs, Memory, or GPUs

Explanation: The Andromeda doesn’t currently have enough available resources to meet your job’s requirements.

Details: This includes not enough cores, memory, or GPUs across all nodes. Jobs will wait until resources are freed up by other jobs completing.

Priority – Waiting for Higher-Priority Jobs

Explanation: Our SLURM scheduler uses a fair share algorithm to determine job priority. This means your job might be queued behind others with higher priority.

Details: On Andromeda, job priority can be influenced by several factors, including job size, fairshare, partition, job age in the queue, and other scheduling policies.

Dependency – Waiting on Another Job

Explanation: Your job has a dependency on another job that hasn’t finished yet.

Details: This is common in workflows where jobs are chained. For example, you can manage dependencies using flags like –dependency=afterok:<job_id> to ensure jobs run in the desired order.

For more information, please refer to the following guide.

Node Status – Nodes Are DOWN, DRAINED, or Reserved

Explanation: The nodes required by your job are in a non-usable state.

Details: Nodes may be marked as DOWN (hardware issues), DRAINED (manually taken offline), or reserved for specific users or jobs.

QOSMaxCpuPerUserLimit – CPU Limit Exceeded

Explanation: Your job is requesting more CPUs than allowed under the current quota of 3600 cores per user

Details: You can either reduce the number of requested CPUs or wait until currently running jobs complete and resources become available. Once within the limit, your job will be scheduled and run automatically.

QOSMaxJobsPerUserLimit – Concurrent Job Limit Exceeded

Explanation: You’ve reached the maximum number of concurrent interactive jobs allowed on the Andromeda cluster.

Details: Each user is limited to 2 interactive terminal jobs at a time. Please wait for one or more of your current jobs to finish before starting a new one.
If you need to run more jobs, consider using batch job submissions instead of interactive sessions.

QOSMaxMemoryPerUser – Memory Usage Limit Exceeded

Explanation: Your job’s memory request exceeds the per-user memory limit on interactive partition nodes.

Details: Each user is allocated a maximum of 64 GB of memory across up to two jobs running on interactive partitions. This limitation often causes issues when submitting interactive jobs with high memory requirements. To avoid this, consider reducing the requested memory, switching to batch job submissions—which offer more flexible resource allocation—or submitting through one of the time-based partitions (short, medium, or long), which may better accommodate higher memory needs

10. When will my PD (Pending) job start?

To find out when your PD job will start, use the following command:

squeue -j <job_id> –start

For example:

[johnchris@c014 2025.01.002]$ squeue -j 752867 --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
752867 short ci-freq johnchris PD 2025-08-22T22:04:30 1 c223 (Priority)

This means: Your job is pending (PD) and is scheduled to start at 22:04:30 on August 22, 2025.The reason it’s pending is Priority, which usually means it’s waiting for higher-priority jobs to finish or for resources to become available.