Mod:Hunt Research Group/Running jobs on the HPC
Introduction
The aim of this wiki is to help new users get started running jobs on Victoria's HPC Raapoi and to take you through:
- What changes you need in your .com files
- Setting up your directories and naming files
- Creating a run script
- Running your first job
An important resource is the Raapoi wiki
Your com file
- A file created on your mac will not run on the hpc, it needs some additional information
- You need to add the following at the top of your .com file
- -
%mem=for how much memory is required - -
%nprocs=for how many processors are required
- as a default you should use
%mem=30GBand%procs=16 - the following is the first part of a test.com file setup for the hpc
%nprocs=16 %mem=30GB %chk=test.chk # hf/3-21g geom=connectivity Title Card Required 0 1 C
- See the page on memory and disk for further information on non-default usage of cpus and memeory
Set up your directories
- You have 2 directories on Raapoi one is on /nfs/home and one on /nfs/scratch
- -nfs stands for networked file system
- your "home" directory is on the login node
- - use your home directory to stage your jobs and back-up for completed jobs
- - the home directory is backed-up
- your "scratch" directory is on a huge shared disk
- - use your scratch directory to run your jobs
- - the scratch directory is NOT BACKED-UP
- - this means you should assume that all the files on scratch could potentially be lost
- organise your files (YES DO IT NOW)
- - each molecule should have its own folder
- - inside each folder have ALL the different conformers, use the filename to differentiate them, eg add _xx.com etc where xx designates the conformer
- - NEVER deleate a *.log unless Tricia says its ok
- - I add an _1 _2 _3 etc for each run of a particular conformer
- - in the file name indicate if the job is
- opt (optimisation)
- fopt (frequency and optimisation)
- freq (frequency only)
- pop (some form of population analysis)
- nmr (nmr computation)
- some examples, of a water dimer, dimer A took 3 runs to fully optimise, then a frequency analysis was carried out (and a minima confirmed) followed by a population analysis. dimer B has only just been started and has run once.
water_dimer_A_opt_1.com water_dimer_A_opt_2.com water_dimer_A_opt_3.com water_dimer_B_opt_1.com water_dimer_A_freq.com water_dimer_A_pop.com
Runscript
- You need to submit your job through a batch queing system or sheduler
- copy the runscript given below into a file called rung16.sh
- place rung16.sh file in the same directory as your com file
- change the following
- - abc=your job name
- - username=your user name
- you can use vi global replace to change all abc and all username
- - in command mode
- -
:%s/search_string/replacement_string/g
- then you will need to run the script which will submit your job to the batch processing facility
- - type
sbatch rung16.sh - - if successful you will see something like
Submitted batch job 295704
- runscript
#!/bin/bash #SBATCH --job-name=abc #SBATCH --cpus-per-task=16 #SBATCH --mem=32GB #SBATCH --partition=quicktest #SBATCH --time=01:00:00 #SBATCH -o /nfs/scratch/username/abc.log #SBATCH -e /nfs/scratch/username/abc.err cp /nfs/home/username/abc.com /nfs/scratch/username/abc.com test -r abc.chk if [ $? -eq 0 ] then cp /nfs/home/username/abc.chk /nfs/scratch/username/abc.chk fi cd /nfs/scratch/username/ module --quiet purge module load gaussian/g16 g16 abc.com test -r abc.log if [ $? -eq 0 ] then cp /nfs/scratch/username/abc.log /nfs/home/username/abc.log fi test -r abc.chk if [ $? -eq 0 ] then cp /nfs/scratch/username/abc.chk /nfs/home/username/abc.chk fi
- running the job in a specific node
- - add the following line in your runscript
#SBATCH --nodelist=node_name
- - nodename can refer to any node, such as itl02n01, itl02n02, etc.
- - you can specify more than one node in the list, separating them with commas
- - the job scheduler will allocate only the nodes explicitly listed in the option
- - if the specified nodes are not available, the job will wait in the queue until those nodes become available
Monitoring your Job
Now that your job has been submitted you can monitor by using the command vuw-myjobs. This gives you the status of your jobs in the queues. Useful commands may be:
vuw-myjobsto get your jobs that are runningvuw-alljobsto get a list of all the jobs that are runningvuw-job-historyto get quick view of all the jobs completed within the last 5 days
To delete a job from the queue you can use the command:
scancel [jobID]
The jobID is the first number when you view the running jobs, for example the jobID of the two jobs below are 472471 and 472470
QUICK TEST PARTITION QUEUE: quicktest (Default partition)
JOBID NAME USER CPUS MIN_MEM TIME TIME_LEFT STATE NODELIST(REASON)
472471 DOS_Full holmeswi 64 100G 17:03 42:57 RUNNING itl02n02
472470 DOS_Full holmeswi 64 100G 17:07 42:53 RUNNING itl02n01
To delete all your jobs from the queue were [user] is your username:
scancel -u [user]
Keep checking your job until it has run.
If your job has run and "completed" then you the .log file should be copied back to your working directory, check this to see if your job was successful. Your job can finish and NOT complete successfully, it is up to you to check each "completed" job to make sure it is ok. The job can "complete" with an error message, see the "checking your job" section if there has been a problem.
You will also find a file which has the extension: *.err. If there has been an error it will be detailed with this file along with the resources requested and used by your job.