Mod:Hunt Research Group/Running jobs on NeSI

From wiki
Jump to navigation Jump to search

Using NeSI

getting started

  • contact tricia to get an account
  • login online my.nesi.org.nz using your VUW credentials
  • configure your 2-factor authentication as described here
  • setup your .config file as described here
  • ssh mahuika and provide the required passwords and you are in!
[tricia@]/Users/tricia $ ssh mahuika
(huntpa@lander.nesi.org.nz) Login Password (First Factor):
(huntpa@lander.nesi.org.nz) Authenticator Code (Second Factor):
(huntpa@login.mahuika.nesi.org.nz) Login Password:
  • you should only need to authenticate once (each time you login), there after ssh mahuika form new terminals will go straight to your home directory
  • you will want to edit either the .bashrc or .bash_profile to your liking
  • take a look at the slides here slides
  • general top level support page here
  • information on the partitions here

directories and moving files to and from mahuika

  • directories are
/home/username 20GB
/nesi/project/vuw04056 100GB
/nesi/nobackup/vuw04056 10TB
  • on the local computer
scp <path/filename> mahuika:<path/filename>
  • for example
[tricia@]/Volumes/Tricia_Home/work/jobs/sangoro $ scp water.inp mahuika:.
water.inp        100%  218    13.8KB/s   00:00    
[tricia@]/Volumes/Tricia_Home/work/jobs/sangoro
  • further information (including for Windows) is here

TEST creating a batch script and checking you can run jobs

  • NeSI uses Slurm like Raapoi so you will need a batch script
  • test that you can submit with this script
  • copy into run.sh and type sbatch run.sh
#!/bin/bash -e
#SBATCH --job-name=SerialJob # job name (shows up in the queue)
#SBATCH --time=00:01:00      # Walltime (HH:MM:SS)
#SBATCH --mem=512MB          # Memory in MB
#SBATCH --qos=debug          # debug QOS for high priority job tests

pwd # Prints working directory
  • you should see a slurm-*.out file which contains your pwd
huntpa@mahuika01 /home/huntpa $ cat slurm-52598016.out
/home/huntpa
  • check you can run a parallel job
#!/bin/bash -e
#SBATCH --job-name=MPIJob    # job name (shows up in the queue)
#SBATCH --time=00:01:00      # Walltime (HH:MM:SS)
#SBATCH --mem-per-cpu=512MB          # Memory in MB
#SBATCH --cpus-per-task=4       # 2 Physical cores per task.
#SBATCH --ntasks=2              # number of tasks (e.g. MPI)

srun pwd # Prints working directory

REAL batch script

  • modify this script to run jobs
  • I call mine runorcaP.sh
  • you should be running jobs in our nobackup directory /nesi/nobackup/vuw04056
  • create your own directory in this folder and run your jobs there
eg mine is /nesi/nobackup/vuw04056/tricia
  • copy the completed job files back into our shared project directory /nesi/project/vuw04056/your_name
  • make sure you set maxcore at about (2/3)*(mem/ntasks)
so in the example there is 1G per task, so I set %maxcore 800 (because its a small job)
  • a node has 128 cores, each core has 2cpus, in the script we call cores
we do NOT want to work across nodes hence nodes=1
we want orca to call 8 or 16 cores hence ntasks=no of cores
in the sacct analysis you might see 16 or 32 cpus (because 2 cpus per core)
  • partition selection set on advice from NeSI support
  • if you want to submit more than 50 linked jobs (ie same script different geometry) use an array job
#!/bin/bash -e
#SBATCH --job-name=XX      
#SBATCH --time=05:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --mem=8G
#SBATCH --error=./workdir_%j/slurm_%j.err
#SBATCH --output=./workdir_%j/XX.out
#SBATCH --partition=milan,large,long,bigmem

echo "slurm job ID: ${SLURM_JOBID}" > ./workdir_${SLURM_JOB_ID}/XX.minfo
echo "start time " >> ./workdir_${SLURM_JOB_ID}/XX.minfo
date >> ./workdir_${SLURM_JOB_ID}/XX.minfo

cd ./workdir_${SLURM_JOB_ID}
cp ${SLURM_SUBMIT_DIR}/XX.inp .

module --quiet purge
module load ORCA/5.0.4-OpenMPI-4.1.5

# ORCA under MPI requires that it be called via its full absolute path
orca_exe=$(which orca)

# Don't use "srun" as ORCA does that itself when launching its MPI process.
${orca_exe} XX.inp

echo "finish time " >> ./workdir_${SLURM_JOB_ID}/XX.minfo
date >> ./workdir_${SLURM_JOB_ID}/XX.minfo

checking jobs

  • key commands
squeue all jobs
squeue --me my jobs
scancel jobID kill named job
sacct -x jobs run in last day
  • we "pay" for NeSI usage so it is important to make sure you are using the system effectively
nn_seff jobID summary of cpu and memory efficiency
eg water input file
huntpa@mahuika01 /home/huntpa $ cat water.inp
!PBE opt numfreq def2-SVP def2/J smallprint TightSCF NoPop xyzfile
%maxcore 2000
%pal nprocs 8 end
%elprop Polar 1 end
* xyz 0 1
O   0.0000   0.0000   0.0626
H  -0.7920   0.0000  -0.4973
H   0.7920   0.0000  -0.4973
*
water job submit script below, run for 10min, the job did not complete but only just
huntpa@mahuika01 /home/huntpa $ nn_seff 52598705
Cluster: mahuika
Job ID: 52598705
State: TIMEOUT
Cores: 8
Tasks: 8
Nodes: 2
Job Wall-time:  103.2%  00:10:19 of 00:10:00 time limit
CPU Efficiency:  34.4%  00:28:25 of 01:22:32 core-walltime
Mem Efficiency:   1.8%  590.79 MB (0.00 MB to 100.31 MB / task) of 32.00 GB (4.00 GB/task)
  • check the memory usage after an example job
huntpa@mahuika01 /home/huntpa/workdir_52598705 $ grep -i 'Memory' water.out
   Shared memory     :  Shared parallel matrices
Maximum memory used throughout the entire GTOINT-calculation: 8.7 MB
Maximum memory used throughout the entire SCF-calculation: 6.2 MB
Maximum memory used throughout the entire SCFGRAD-calculation: 4.9 MB
Maximum memory used throughout the entire GTOINT-calculation: 8.7 MB
Maximum memory used throughout the entire SCF-calculation: 6.2 MB
Maximum memory used throughout the entire SCFGRAD-calculation: 5.0 MB
Maximum memory used throughout the entire GTOINT-calculation: 8.7 MB
Maximum memory used throughout the entire SCF-calculation: 6.2 MB
Maximum memory used throughout the entire SCFGRAD-calculation: 4.9 MB
Maximum memory used throughout the entire GTOINT-calculation: 8.7 MB
Maximum memory used throughout the entire SCF-calculation: 6.2 MB
Maximum memory used throughout the entire SCFGRAD-calculation: 5.1 MB
Maximum memory used throughout the entire GTOINT-calculation: 8.7 MB
Maximum memory used throughout the entire SCF-calculation: 6.2 MB
Memory available               ... 1996.8 MB
Memory needed per perturbation ...   0.0 MB
  • use

sacct --format="JobID,JobName,Elapsed,AveCPU,MinCPU,TotalCPU,Alloc,NTask,MaxRSS,State" -j <jobid>

huntpa@mahuika01 /home/huntpa $ sacct --format="JobID,JobName,Elapsed,AveCPU,MinCPU,TotalCPU,Alloc,NTask,MaxRSS,State" -j 52598705
JobID           JobName    Elapsed     AveCPU     MinCPU   TotalCPU  AllocCPUS   NTasks     MaxRSS      State 
------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- 
52598705          water   00:10:19                        28:24.740         16                        TIMEOUT 
52598705.ba+      batch   00:10:20   00:00:03   00:00:03  00:02.601         12        1     87444K  CANCELLED 
52598705.ex+     extern   00:10:19   00:00:00   00:00:00  00:00.001         16        2          0  COMPLETED 
52598705.0   orca_gtoi+   00:00:18   00:00:06   00:00:04  00:56.724         16        8    101592K  COMPLETED 
52598705.1   orca_scf_+   00:02:27   00:00:43   00:00:26  05:51.947         16        8    100872K  COMPLETED 
52598705.2   orca_scfg+   00:00:06   00:00:04   00:00:03  00:35.225         16        8     98876K  COMPLETE

... and more

check our usage

  • check core usage by the group

nn_corehour_usage vuw04056

huntpa@mahuika01 /home/huntpa $ nn_corehour_usage vuw04056

Note: Fair Share rankings will only be shown for the current cluster, mahuika.

Project vuw04056
================

Project vuw04056 on the mahuika cluster
---------------------------------------
Fair share score on mahuika: 0.998855 out of 1.0
Ranked 158th of 622 active projects (behind 25.24% of active projects)

Usage period                               CPU core hours P100 GPU device hours A100 GPU device hours GB-hours of RAM Compute units
------------                               -------------- --------------------- --------------------- --------------- -------------
2025-01-14T15:00:00 to 2025-01-15T15:00:00              0                     0                     0               0             0

running an orca job

  • note you cannot use Gaussian or Gaussview on NeSI
  • we will be using Orca
  • check that a code is installed and available using

module spider "code"

  • for example module spider orca
----------------------------------------------------------------------------------
  ORCA:
----------------------------------------------------------------------------------
    Description:
      ORCA is a flexible, efficient and easy-to-use general purpose tool for quantum chemistry with specific
      emphasis on spectroscopic properties of open-shell molecules. It features a wide variety of standard quantum
      chemical methods ranging from semiempirical methods to DFT to single- and multireference correlated ab initio
      methods. It can also treat environmental and relativistic effects. 

     Versions:
        ORCA/4.0.1-OpenMPI-2.0.2
        ORCA/4.2.1-OpenMPI-3.1.4
        ORCA/5.0.1-OpenMPI-4.1.1
        ORCA/5.0.3-OpenMPI-4.1.1
        ORCA/5.0.4-OpenMPI-4.1.5

----------------------------------------------------------------------------------
  For detailed information about a specific "ORCA" module (including how to load the modules) use the module's full name.
  For example:

     $ module spider ORCA/5.0.4-OpenMPI-4.1.5
----------------------------------------------------------------------------------
  • here is a very basic submit script
#!/bin/bash -e
#SBATCH --job-name=water      
#SBATCH --time=00:10:00
#SBATCH --ntasks=8
#SBATCH --mem-per-cpu=2G
#SBATCH --error=./workdir_%j/slurm_%j.err
#SBATCH --output=./workdir_%j/water.out

echo "slurm job ID: ${SLURM_JOBID}" > ./workdir_${SLURM_JOB_ID}/water.minfo
echo "start time " >> ./workdir_${SLURM_JOB_ID}/water.minfo
date >> ./workdir_${SLURM_JOB_ID}/water.minfo

cd ./workdir_${SLURM_JOB_ID}
cp ${SLURM_SUBMIT_DIR}/water.inp .

module --quiet purge
module load ORCA/5.0.4-OpenMPI-4.1.5

# ORCA under MPI requires that it be called via its full absolute path
orca_exe=$(which orca)

# Don't use "srun" as ORCA does that itself when launching its MPI process.
${orca_exe} water.inp

echo "finish time " >> ./workdir_${SLURM_JOB_ID}/water.minfo
date >> ./workdir_${SLURM_JOB_ID}/water.minfo