ExpHP/gpaw-on-cci.md

## gpaw-on-cci.md

      
    Raw
  

              gpaw-on-cci.md
            
          
    GPAW on CCI

First, you will need to get a CCI account.  Most of the information you need to get your account set up on AiMOS is here:

https://secure.cci.rpi.edu/wiki/

Notice that access currently requires 2FA through Google Authenticator.
The purpose of this document is to help you with setting up GPAW on the cluster, after you have an account, as there are numerous issues you are bound to run into that you will likely not have previously encountered on other clusters at RPI.  It will assume you are already familiar with ssh, linux, and slurm.
(important: the next couple of things are all from the CCI Wiki (https://secure.cci.rpi.edu/wiki/) and are copied here mostly only to help point out which specific bits of information are important.  I will link to the specific pages where each item is found; Please check those pages to verify that this information is up to date.)
DCS Cluster

You will be using the DCS (AiMOS) cluster.
https://secure.cci.rpi.edu/wiki/clusters/DCS_Supercomputer/
# on one of the landing pads...
ssh dcsfen01  # or dcsfen02
This cluster uses a PowerPC64 Little Endian architecture, unlike the landing pad and most other clusters with are x86-64.  This means binaries and libs that work on the landing pad cannot run on DCS, and vice versa, which is a tad annoying since all of your data (including your home directory and .bashrc) are shared between all machines.
You don't have to do anything about this (though you can if you want to; e.g. I installed conda to an architecture-specific directory). Just keep this in mind and try to avoid doing things while on the landing pad (i.e. get on dcsfen01 ASAP).
HTTP Proxy

You will require an HTTP proxy to access to a couple of URLs for conda and python packages.
Main page: https://secure.cci.rpi.edu/wiki/landingpads/Proxy/
# Put this in your .bashrc

# Enables access to github, gitlab, pypi, many other whitelisted servers
export http_proxy=http://proxy:8888
export https_proxy=$http_proxy
Modules

You should just generally ignore the compilers and MPI implementations available through the module system on DCS. (i.e. don't bother with module load)
This is because you will be using conda.  conda tries to operate like a little walled garden and things can go awry when it is not in full control of your compiler suite.
Install GPAW

Installing GPAW on AiMOS is actually quite an exercise, thanks to the PowerPC64 architecture.  However, you can make this easier by using conda, which will provide access to pre-built libraries for things like image libraries and libxc (which gpaw depends on).
Install Conda

As recommended by https://secure.cci.rpi.edu/wiki/software/Conda/, make absolutely sure to install it somewhere in ~/barn/!!!  (otherwise you will run afoul of draconian space limits on /home)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.sh
bash Miniconda3-latest-Linux-ppc64le.sh -p ~/barn/miniconda3
# accept the license
# answer 'yes' when it asks to run conda.init
source ~/.bashrc

(if the wget fails, see the "HTTP proxy" step above)
NOTE: The bottom of the linked CCI page mentions a package cache for conda.  Do not bother with it. The packages in that cache are hellishly outdated.
Create a conda environment

Create these two files somewhere you'd like to work in ~/scratch:
requirements.txt
gpaw==21.6.0
ase==3.22.0

environment.yml
name: gpaw-env
channels:
  - defaults
  - anaconda
  - conda-forge
dependencies:
  - python=3.8
  - pip=21.1
  - pillow=8.3
  - matplotlib=3.4
  - conda-forge::openmpi=4.1
  - conda-forge::c-compiler=1.2
  - conda-forge::compilers=1.2
  - conda-forge::libblas=3.9=*openblas
  - conda-forge::liblapack=3.9=*openblas
  - conda-forge::libxc=5.1
  - pip:
    - -r requirements.txt

Now use these files to initialize a new conda environment:
conda env create --file environment.yml

If this succeeds, you can then do
conda activate gpaw-env

to enable this version of python, along with openmpi.
GPAW Datasets

The HTTP proxy does not enable access to the gpaw datasets.
(you can't even use gpaw install-data --register; it still tries to access the internet for something)
I've given my copy read permission to the whole group, so anybody in the CMND project can add this to ~/.bashrc:
LAMPAM_ROOT=/gpfs/u/barn/CMND/shared/lampam
export GPAW_SETUP_PATH=$LAMPAM_ROOT/share/gpaw/datasets/gpaw-setups-0.9.20000

other deps

If you later find that you are missing some additional python packages you need, you can use the following (note: the environment must be activated):
python3 -m pip install some-package-name

Running

Key notes:

The conda installation of openmpi does not work with the gpaw wrapper script (e.g. gpaw -P4). You must use mpirun python instead.
Each node of DCS has 160 logical threads. You do not need to use all of them. (multiple jobs can share a node)
DCS requires the --mem-per-cpu and --gres args to sbatch.   The --gres=gpu:1 argument is because you are required to claim at least one GPU even if you are not using it.

Here is a quick test script.
script.py
from ase import Atoms
from gpaw import GPAW, PW
from ase.parallel import parprint
h2 = Atoms('H2', [(0, 0, 0), (0, 0, 0.74)])
h2.center(vacuum=2.5)
h2.calc = GPAW(xc='PBE', mode=PW(300), txt='h2.txt')

parprint(h2.get_potential_energy())
parprint(h2.get_forces())

job.sbatch
#!/bin/bash

#SBATCH --job-name=h2test
#SBATCH --time=120
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --gres=gpu:1

mpirun -n $SLURM_NTASKS python3 script.py

Verify that this script works:
$ sbatch job.sbatch
Submitted batch job 1320681
$ cat slurm-1320681.out
-6.631111781412142
[[ 0.          0.         -0.63920606]
 [ 0.          0.          0.63920606]]
$ cat h2.txt

  ___ ___ ___ _ _ _
 |   |   |_  | | | |
 | | | | | . | | | |
 |__ |  _|___|_____|  21.6.0
 |___|_|

User:   CMNDlmpm@dcs223
Date:   Fri Oct 22 16:57:56 2021
Arch:   ppc64le
Pid:    2864505
Python: 3.8.12
gpaw:   /gpfs/u/barn/CMND/shared/lampam/pkg/ppc64le/conda/4.10.3/envs/aaaaa/lib/python3.8/site-packages/gpaw
_gpaw:  /gpfs/u/barn/CMND/shared/lampam/pkg/ppc64le/conda/4.10.3/envs/aaaaa/lib/python3.8/site-packages/
        _gpaw.cpython-38-powerpc64le-linux-gnu.so
ase:    /gpfs/u/barn/CMND/shared/lampam/pkg/ppc64le/conda/4.10.3/envs/aaaaa/lib/python3.8/site-packages/ase (version 3.22.0)
numpy:  /gpfs/u/barn/CMND/shared/lampam/pkg/ppc64le/conda/4.10.3/envs/aaaaa/lib/python3.8/site-packages/numpy (version 1.20.3)
scipy:  /gpfs/u/barn/CMND/shared/lampam/pkg/ppc64le/conda/4.10.3/envs/aaaaa/lib/python3.8/site-packages/scipy (version 1.7.1)
libxc:  5.1.5
units:  Angstrom and eV
cores: 16
OpenMP: False
OMP_NUM_THREADS: 1

Input parameters:
  mode: {ecut: 300.0,
         gammacentered: False,
         name: pw}
  xc: PBE

System changes: positions, numbers, cell, pbc, initial_charges, initial_magmoms

Initialize ...

 (...snip...)

Notes on parallelism

The conda environment above uses a version of gpaw that was built without OpenMP and without ScaLAPACK.  ScaLAPACK in particular is greatly desirable for large LCAO computations so someone may want to look into making this work...