Mod:Hunt Research Group/Running jobs on Raapoi sub

From wiki
Jump to navigation Jump to search

Introduction

The aim of this wiki is to help new users get started running Gaussian 16, ORCA 6.1.1, and CREST jobs on VUW's HPC Raapoi in an automated manner using a script.

An important resource is the Raapoi wiki

Your input file

  • Using the sub script, you need an input file from any of the following software
  • The extension of the input file does not matter. The script can handle any extension (e.g., .gjf, .com, .inp, .toml etc.)
  • You need to be in the same directory as your input file to run the job
  • See the page on memory and disk for further information on non-default usage of cpus and memeory

Set up your directories

  • You have 2 directories on Raapoi one is on /nfs/home and one on /nfs/scratch
-nfs stands for networked file system
  • your "home" directory is on the login node
- use your home directory to stage your jobs and back-up for completed jobs
- the home directory is backed-up
  • your "scratch" directory is on a huge shared disk
- use your scratch directory to run your jobs
- the scratch directory is NOT BACKED-UP
- this means you should assume that all the files on scratch could potentially be lost
  • Create the following directories in your home directory
    • scripts/jobs_info (it should be like /nfs/home/${USER}/scripts/jobs_info)
  • Create the following directories in your scratch directory
    • Gaussian_Scratch (it should be like /nfs/scratch/${USER}/Gaussian_Scratch)
  • organise your files (YES DO IT NOW)
- each molecule should have its own folder
- inside each folder have ALL the different conformers, use the filename to differentiate them, eg add _xx.com etc where xx designates the conformer
- I add an _1 _2 _3 etc for each run of a particular conformer
- in the file name indicate if the job is
opt (optimisation)
fopt (frequency and optimisation)
freq (frequency only)
pop (some form of population analysis)
nmr (nmr computation)
  • some examples, of a water dimer, dimer A took 3 runs to fully optimise, then a frequency analysis was carried out (and a minima confirmed) followed by a population analysis. dimer B has only just been started and has run once.
water_dimer_A_opt_1.com
water_dimer_A_opt_2.com
water_dimer_A_opt_3.com
water_dimer_B_opt_1.com
water_dimer_A_freq.com
water_dimer_A_pop.com

sub Script

  • sub script is a script to create the files for batch queing system in the background and delete them after a job completes
  • save this script to the scripts directory in your home
cd scripts
nano sub
  • Copy the script below and save the file
#!/usr/bin/env bash
#
# Usage: sub [-t <time>] [-P <CPUs>] [-p <prog>] [-m <memory>] [-d <disk>] [-q <partition>] <filename.ext>
# Make the script executable with "chmod u+x sub" and type sub to see the help
#
set -euo pipefail

usage()
{
 cat <<EOF
#######################################################################
#  sub - M.A.Hashmi's general purpose script for SLURM job submission #
#######################################################################

 Syntax:
  sub [-t <time>] [-P <CPUs>] [-p <prog>] [-m <memory>] [-d <disk>] [-q <partition>] <filename.ext>

    <time>      Job time in hh:mm:ss (Default $time).
    <cpunr>     Request a parallel job running on <cpunr> CPUs (Default $cpunr).
    <prog>      Program to use (Default $prog)
		Possible programs:
		g16 orca (also: orca_6.1.0. default is orca6.1.1) crest
		(any Python script can be run with "-p python filename.py")
		crest needs a toml file (https://crest-lab.github.io/crest-docs/page/documentation/inputfiles.html)
    <memory>    Requests RAM in MB per processor (no suffix, Default $mem).
    <disk>      Requests diskspace per processor (MB or GB suffix, Default $disk).
    <partition> Requests the job for a specific partition, e.g., quicktest, parallel, etc. The default is $partition.
              
    Limitation: You have to be in the same directory as your input file when submitting the job.
    Important:  Create the following directories once before using the script:
                "/nfs/home/${USER}/scripts/jobs_info"
                "/nfs/scratch/${USER}/Gaussian_Scratch"
#######################################################################
	
EOF
exit 0
}

set_defaults()
{
# Set defaults for the queueing system
time="05:00:00"	# Time limit for the job
cpunr=8			# Number of CPUs
prog="g16"		# Program to use
mem=2000		# MB per CPU
disk="10gb"		# Storage limit
partition="quicktest"

# Set some default directories for this script
stddir="/nfs/home/${USER}/scripts/jobs_info"
SCRATCHDIR_TEMPLATE='/nfs/scratch/'"${USER}"'/$SLURM_JOB_ID'
G16SCRATCHDIR="/nfs/scratch/${USER}/Gaussian_Scratch"
}

# Recalculate total memory for SLURM after parsing arguments
recalculate_memory()
{
  # mem is in MB per CPU. total_mem in GB (integer)
  total_mem=$(( (cpunr * mem) / 1000 + 2 )) # Add 2GB buffer for SLURM request
}

#########################
# start
#########################
set_defaults

[ -z "${1:-}" ] && usage

# Argument parsing using getopts
while getopts ":t:P:p:m:d:q:h" opt; do
  case $opt in
    t) time="$OPTARG" ;;
    P) cpunr="$OPTARG" ;;
    p) prog="$OPTARG" ;;
    m) mem="$OPTARG" ;;
    d) disk="$OPTARG" ;;
    q) partition="$OPTARG" ;;
    h) usage ;;
    \?) echo "Invalid option: -$OPTARG" >&2; usage ;;
  esac
done
shift $((OPTIND-1))

# Check if an input file was provided
if [ -z "${1:-}" ]; then
  echo "Error: No input file provided." >&2
  exit 1
fi

inputfile=$1
if [ ! -f "$inputfile" ]; then
  echo "Error: input file '$inputfile' not found." >&2
  exit 1
fi

jobname="${inputfile%.*}"

# ensure stddir exists
mkdir -p "$stddir"

# Recalculate memory
recalculate_memory

# safer mktemp usage
submitfile="$(mktemp "$stddir/${jobname}.${USER}.XXXXXX")"
tmpfile="$(mktemp "$stddir/${jobname}.tmp.XXXXXX")"

# Echo summary
#echo "Preparing job for: program=$prog ; input=$inputfile ; jobname=$jobname"
#echo "CPUs=$cpunr ; per-CPU mem=${mem}MB ; Slurm total memory=${total_mem}GB"

################################################################################
# Functions to patch inputs for Gaussian and ORCA
################################################################################

# Patch Gaussian: set %NProcShared and %mem (insert at file top and before Link1 blocks)
patch_gaussian()
{
  local infile="$1"
  local outfile="$2"
  local nproc="$3"
  local memstr="$4"   # e.g., "14GB"

  awk -v proc="%NProcShared=${nproc}" -v memv="%mem=${memstr}" '
    BEGIN { IGNORECASE=1; header_printed=0 }
    # skip existing NProcShared / NProcLinda / %mem lines
    /^[ \t]*%NProcShared/ { next }
    /^[ \t]*%NProcLinda/ { next }
    /^[ \t]*%mem/ { next }
    # before printing first real line, print header once
    { if (!header_printed) { print proc; print memv; header_printed=1 } }
    # when hitting a Link1 separator, print the separator and the header lines again
    /^ *-+ *Link1 *-*/ { print; print proc; print memv; next }
    { print }
  ' "$infile" > "$outfile"
}

# Patch ORCA: remove existing %MaxCore and nprocs and insert new ones
# NOTE: this uses a safe two-step approach to avoid complicated in-script quoting issues
patch_orca()
{
  local infile="$1"
  local outfile="$2"
  local nproc="$3"
  local maxcore_mb="$4"

  local pal_line="%pal nprocs ${nproc} end"
  local maxcore_line="%MaxCore ${maxcore_mb}"

  # Step 1: remove old lines that mention %MaxCore or 'nprocs' (case-insensitive)
  # write to a temp file
  local t1
  t1="$(mktemp "$stddir/patch_orca.XXXXXX")"
  grep -vi "%MaxCore" "$infile" | grep -vi "nprocs" > "$t1" || true

  # Step 2: Insert pal_line and maxcore_line after any "$new_job" line if present,
  # otherwise prepend them at the top.
  # Use awk with -v variables (safe from shell quoting issues).
  awk -v pal="$pal_line" -v maxc="$maxcore_line" '
    BEGIN { seen_new_job = 0; printed_header = 0 }
    /^\$new_job/ { print; print pal; print maxc; seen_new_job = 1; next }
    { print }
    END {
      if (!seen_new_job) {
        # No $new_job block found: we need to prepend pal/maxc
        # Since awk prints file in order, we cannot prepend here; exit with code to signal caller.
        # We will handle prepending in the caller by checking exit code.
      }
    }
  ' "$t1" > "$outfile".part || true

  # If the generated part file contains a pal insertion (i.e. $new_job was found),
  # move it into outfile. Otherwise, prepend pal/maxc at top manually.
  if grep -q "$pal_line" "$outfile".part 2>/dev/null || grep -q "$maxcore_line" "$outfile".part 2>/dev/null; then
    mv -f "$outfile".part "$outfile"
  else
    # Prepend pal/maxc, then the cleaned content (t1)
    {
      printf "%s\n%s\n" "$maxcore_line" "$pal_line"
      cat "$t1"
    } > "$outfile"
    rm -f "$outfile".part || true
  fi

  rm -f "$t1"
}

################################################################################
# Patch requested input file depending on program
################################################################################

case "$prog" in
  g03|g09|g16)
    # Prepare Gaussian memory string: gaussian uses 70% of total_mem (GB)
    GAUSS_MEM_GB=$(awk "BEGIN { printf(\"%d\", ${total_mem} * 0.7 + 0.5) }")
    GAUSS_MEM_STR="${GAUSS_MEM_GB}GB"

    # remove old %mem and %NProcShared etc, then insert new
    cp "$inputfile" "$tmpfile"
    patch_gaussian "$tmpfile" "$submitfile" "$cpunr" "$GAUSS_MEM_STR"
    mv -f "$submitfile" "$inputfile"
    ;;
  orca|orca5|orca_6.1.0|orca_6.1.1)
    # ORCA patching: set %pal nprocs and %MaxCore
    ORCA_MAXCORE_MB="$mem"   # mem is MB per CPU
    cp "$inputfile" "$tmpfile"
    patch_orca "$tmpfile" "$submitfile" "$cpunr" "$ORCA_MAXCORE_MB"
    mv -f "$submitfile" "$inputfile"
    ;;
  crest|script|python)
    # nothing to patch for these
    ;;
  *)
    echo "Warning: unknown program '$prog' — script will continue and create a SLURM script."
    ;;
esac

################################################################################
# Create SLURM submitfile content
################################################################################

cat > "$submitfile" <<EOF
#!/bin/bash
# SLURM submission script generated by sub
#SBATCH --job-name=${jobname}
#SBATCH --output=${PWD}/${jobname}.stdout
#SBATCH --error=${PWD}/${jobname}.stderr
#SBATCH --ntasks=${cpunr}
#SBATCH --nodes=1
#SBATCH --time=${time}
#SBATCH --mem=${total_mem}G
#SBATCH --partition=${partition}
#SBATCH --kill-on-invalid-dep=yes

echo "The Job ID is: " \$JOBID
echo "---- The Job is executed NOW ----"
date
echo "---------------------------------"

EOF

################################################################################
# Program-specific runtime sections
################################################################################

SCRATCHDIR="$SCRATCHDIR_TEMPLATE"

#################################################################
################# Specifics for running g16 #####################
#################################################################
if [ "$prog" = "g16" ]; then

  # Extract referenced .chk files (case-insensitive)
  mapfile -t chk_files_array < <(
      grep -Ei '^\s*%(chk|oldchk)\s*=' "$PWD/$inputfile" \
      | sed -E 's/^\s*%(chk|oldchk)\s*=//I' \
      | sed 's/\r//g' \
      | sed 's/^\s*//; s/\s*$//'
  )

  # Convert array to space-separated string for safe here-doc
  chk_files_string="${chk_files_array[*]}"

cat >> "$submitfile" <<EOF

mkdir -p "${SCRATCHDIR}"
cd "${SCRATCHDIR}"

# Copy input file
cp "$PWD/$inputfile" .

# Copy referenced checkpoint files
for cf in $chk_files_string; do
    [ -z "\$cf" ] && continue
    if [ -f "\$cf" ]; then
        echo "Copying chk file: \$cf"
        cp "\$cf" .
    elif [ -f "$PWD/\$cf" ]; then
        echo "Copying chk file: $PWD/\$cf"
        cp "$PWD/\$cf" .
    else
        echo "Warning: Referenced checkpoint file \$cf not found."
    fi
done

# Load Gaussian 16
module load gaussian/g16
export GAUSS_SCRDIR="${SCRATCHDIR}"

# Run Gaussian in scratch
nice g16 < "$inputfile" > "${jobname}.log" 2>&1

# Copy back relevant files from scratch to PWD
for ext in out log chk wfn wfx fchk; do
    if compgen -G "*.\$ext" > /dev/null; then
        cp -f *.\$ext "$PWD" 2>/dev/null || true
    fi
done

# Clean up scratch
rm -rf "${SCRATCHDIR}"

EOF
fi

#################################################################
################# Specifics for running ORCA ####################
#################################################################
if [[ "$prog" == "orca" || "$prog" == "orca_6.1.0" || "$prog" == "orca_6.1.1" ]]; then

    # Determine ORCA binary path before writing SLURM script
    if [[ "$prog" == "orca_6.1.0" ]]; then
        ORCA_BIN="/home/software/EasyBuild/software/ORCA/6.1.0-foss-2023b-avx2-xtb/bin/orca"
        MODULES="module load GCC/13.2.0
module load xtb/6.7.1
module load OpenMPI/4.1.6
module load ORCA/6.1.0-avx2-xtb"
    else
        ORCA_BIN="/home/software/EasyBuild/software/ORCA/6.1.1-gompi-2023b-avx2/bin/orca"
        MODULES="module load gompi/2023b
module load ORCA/6.1.1-avx2"
    fi

    # Write SLURM submission script
    cat >> "$submitfile" <<EOF
mkdir -p "$SCRATCHDIR"
cd "$SCRATCHDIR"

# Copy input file
cp "$PWD/$inputfile" .

# === Extract referenced .gbw files from %Moinp (case-insensitive) ===
mapfile -t moinp_files < <(
    grep -Ei '^\s*%moinp\s*["'\'']?.+["'\'']?' "$PWD/$inputfile" \
    | sed -E 's/^\s*%moinp\s*["'\'']?//I' \
    | sed -E 's/["'\'']?\s*$//' \
    | sed 's/\r//g'
)

for gbw in "\${moinp_files[@]}"; do
    [ -z "\$gbw" ] && continue
    if [ -f "\$gbw" ]; then
        cp "\$gbw" .
    elif [ -f "$PWD/\$gbw" ]; then
        cp "$PWD/\$gbw" .
    else
        echo "Warning: ORCA MO read file '\$gbw' not found"
    fi
done

# Load ORCA modules
$MODULES

export OMPI_MCA_btl='^uct,ofi'
export OMPI_MCA_pml='ucx'
export OMPI_MCA_mtl='^ofi'

# Run ORCA in scratch
nice "$ORCA_BIN" "$(basename "$inputfile")" > "${jobname}.out" 2>&1

# Copy back relevant files from scratch to PWD
for ext in out plt cub wfn wfx xyz gbw bibtex engrad densities txt cpcm smd hess; do
    if compgen -G "*.\$ext" > /dev/null; then
        cp -f *.\$ext "$PWD" 2>/dev/null || true
    fi
done

# Clean up scratch
rm -rf "$SCRATCHDIR"

EOF
fi

#################################################################
################# Specifics for running CREST ###################
#################################################################
if [[ "$prog" == "crest" ]]; then

    # Extract XYZ filename from TOML input file (case-insensitive)
    xyz_file=$(
        grep -Ei "input\s*=\s*['\"]([^'\"]+)['\"]" "$inputfile" \
        | sed -E "s/.*input\s*=\s*['\"]([^'\"]+)['\"].*/\1/I"
    )

    if [[ -n "$xyz_file" ]]; then
        echo "Detected XYZ file for CREST: $xyz_file"
    else
        echo "Warning: No 'input=...xyz' found inside $inputfile"
    fi

cat >> "$submitfile" <<EOF

mkdir -p "$SCRATCHDIR"
cd "$SCRATCHDIR"

# Copy TOML input
cp "$PWD/$inputfile" .

# Copy XYZ structure if found
if [[ -n "$xyz_file" ]]; then
    if [ -f "$xyz_file" ]; then
        cp "$xyz_file" .
    elif [ -f "$PWD/$xyz_file" ]; then
        cp "$PWD/$xyz_file" .
    else
        echo "Warning: CREST structure file '$xyz_file' not found"
    fi
fi

# Load CREST and XTB
module purge
module load GCC/12.2.0
module load CREST/3.0.1
module load xtb/6.6.1

# === Recommended stack settings for CREST ===
ulimit -s unlimited
export OMP_STACKSIZE=2G
export OMP_NUM_THREADS=${cpunr}
export OMP_MAX_ACTIVE_LEVELS=1

echo "OMP settings:"
echo "  OMP_NUM_THREADS=\$OMP_NUM_THREADS"
echo "  OMP_STACKSIZE=\$OMP_STACKSIZE"
echo "  OMP_MAX_ACTIVE_LEVELS=\$OMP_MAX_ACTIVE_LEVELS"

# Run CREST
crest "$inputfile" > "${jobname}.out" 2>&1

# Copy all output back
rm *.tmp*
cp -f * "$PWD" 2>/dev/null || true

# Clean up scratch
rm -rf "$SCRATCHDIR"

EOF
fi

#################################################################
############# Specifics for running a Python Script #############
#################################################################
if [ "$prog" = "python" ]; then
cat >> "$submitfile" <<EOF

mkdir -p "${SCRATCHDIR}"
cd "${SCRATCHDIR}"

cp "$PWD/$inputfile" .

nice python3 "$inputfile" > "${jobname}.log" 2>&1 || true

cp -f * "$PWD" || true
rm -rf "${SCRATCHDIR}"

EOF
fi

##################################################################
# Footer and submit
##################################################################

cat >> "$submitfile" <<'EOF'

echo ""
echo "---- The Job has finished NOW ----"
echo ""

date
echo ""

# Wait a bit to ensure Slurm updates job accounting
sleep 5

# Append SLURM accounting information to the same stdout file
if command -v sacct >/dev/null 2>&1; then
  sacct -j "${JOBID}" --format=JobID,Elapsed,TotalCPU || true
fi

echo ""
EOF

echo "Job Time: $time, CPUs: $cpunr, Program: $prog, Memory (per-CPU): ${mem}MB, SLURM total: ${total_mem}GB, Disk: $disk, Partition: $partition"

# submit
sbatch "$submitfile"

# cleanup temp file(s)
#rm -f "$tmpfile"

exit 0

  • set permissions on the script
    chmod u+x sub
  • Define the sub script as an alias in your .bashrc file so that you can use it from anywhere. To do so, edit your .bashrc file in your home directory on Raapoi.
nano .bashrc
  • Add the following lines at the end (assuming that your sub script is in /nfs/home/your_username/scripts). Replace your_username with your real user name
alias sub='/nfs/home/your_username/scripts/sub'

How to use the sub Script

  • sub script usage is very simple. If you type sub and press enter, it provides the usage syntax. It is also given below.
  • cd into the directory where your calculation input file is. For example I have a Gaussian input file called water_opt.com that I will run with it.
  • The usage syntax is as follows:
sub [-t <time>] [-P <CPUs>] [-p <prog>] [-m <memory>] [-q <partition>] <filename.ext>
  • where the arguments (-t, -P, etc) have the following meaning
    • -t represents the time in the format 24:00:00 (Hr:min:sec)
    • -P represents the number of processors. e.g. 16 (the default is 8 if you don't specify it)
    • -p represents the program to use (e.g, g16, orca, crest). The default is g16
    • -m specifies the memory/processor in MB (the default is 2000)
    • -q specifies the partition where you want to run the job (e.g., quickest, parallel, bigmem, longrun). The default is quickest
    • at the end you need to give the name of the input file with extension (e.g., water_opt.com)
  • So if I want to run it for my file water_opt.com with 16 processors and 2GB memory per processor on parallel partition, I would run this command.
sub -t 24:00:00 -P 16 -m 2000 -q parallel water_opt.com
  • Similarly, if I want to run ORCA calculation water_opt.inp with 16 processors and 2GB memory per processor on parallel partition, I would run this command.
sub -t 24:00:00 -P 16 -m 2000 -p orca -q parallel water_opt.inp

Monitoring your Job

Now that your job has been submitted you can monitor by using the command vuw-myjobs. This gives you the status of your jobs in the queues. Useful commands may be:

vuw-myjobs to get your jobs that are running
vuw-alljobs to get a list of all the jobs that are running
vuw-job-history to get quick view of all the jobs completed within the last 5 days

To delete a job from the queue you can use the command:

scancel [jobID]

The jobID is the first number when you view the running jobs, for example the jobID of the two jobs below are 472471 and 472470

QUICK TEST PARTITION QUEUE: quicktest (Default partition) 
          JOBID             NAME       USER    CPUS MIN_MEM         TIME    TIME_LEFT      STATE NODELIST(REASON)
         472471         DOS_Full   holmeswi      64    100G        17:03        42:57    RUNNING itl02n02
         472470         DOS_Full   holmeswi      64    100G        17:07        42:53    RUNNING itl02n01

To delete all your jobs from the queue were [user] is your username:

scancel -u [user]

Keep checking your job until it has run.

If your job has run and "completed" then you the .log file should be copied back to your working directory, check this to see if your job was successful. Your job can finish and NOT complete successfully, it is up to you to check each "completed" job to make sure it is ok. The job can "complete" with an error message, see the "checking your job" section if there has been a problem.

You will also find a file which has the extension: *.stderr. If there has been an error it will be detailed with this file along with the resources requested and used by your job. The *.stdout file contains some useful information about your completed job.