sean-smith/hpcg.md

## hpcg.md

      
    Raw
  

              hpcg.md
            
          
    AWS ParallelCluster + AWS Batch

Today I'm going to demonstrate running High Performance Conjucate Grandients (HPCG) in a containerized workload. This takes advantage of AWS ParallelCluster, AWS Batch, and OpenMPI.
First install aws-parallelcluster:
$ pip install aws-parallelcluster
Edit the file to include the awsbatch cluster configuration:
$ vim ~/.parallelcluster/config
Add to this file the following, you'll need a public and private subnet, see Public Private Networking for instructions on how to set that up.
[global]
update_check = true
sanity_check = true
cluster_template = awsbatch

[aws]
aws_region_name = us-east-1

[cluster awsbatch]
scheduler = awsbatch
key_name = [your key]
min_vcpus = 72
desired_vcpus = 72
max_vcpus = 288
vpc_settings = public-private
master_instance_type = c5.xlarge
compute_instance_type = c5n.18xlarge

[vpc public-private]
vpc_id = vpc-00d2e489741609bc2
master_subnet_id = subnet-0152608e422c75189
compute_subnet_id = subnet-0baadf9781f59a6a1
Now, create the cluster:
$ pcluster create awsbatch-cluster
Creating stack named: parallelcluster-hpcg
Status: parallelcluster-hpcg - CREATE_COMPLETE
ClusterUser: ec2-user
MasterPublicIP: 54.35.249.0
MasterPrivateIP: 10.0.0.35
Once that's completed, ssh in. You may have to specify the keypath with the -i flag if you're not using a default key.
$ pcluster ssh awsbatch -i ~/.ssh/id_rsa
Running awsbhosts shows you the hosts that are running:
[ec2-user@ip-10-0-0-182 ~]$ awsbhosts
ec2InstanceId        instanceType    privateIpAddress    publicIpAddress      runningJobs
-------------------  --------------  ------------------  -----------------  -------------
i-07148c539c09ae9b8  c5n.18xlarge    10.0.1.171          -                              0
You can see there's one c5n.18xlarge instance running, this is because we set min_vcpus = 72, had we set min_vcpus = 0, there would be no hosts running.
Now let's run through a basic hello world example to demonstrate how it works:
https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/03_batch_mpi.html
Now, on the master instance clone the parallelcluster repo:
$ git clone https://github.com/aws/aws-parallelcluster.git
$ cd aws-parallelcluster/cli/pcluster/resources/batch/docker/
Create a Makefile with the following contents:
# Makefile
distro=alinux
uri=[URI from ECR console]

build:
        docker build -f $(distro)/Dockerfile -t pcluster-$(distro) .
        docker build -t $(uri) .

tag:
        docker tag $(uri) $(uri):$(distro)

push: build tag
        docker push $(uri):$(distro)
To get that URI, go to the ECR Console and find an image with a name similar to paral-docke-t6ayh0ia49nm (you can sort by latest created)

Grab that URI, it should look like: 112850485306.dkr.ecr.us-east-1.amazonaws.com/paral-docke-t6ajh0ia39nm
Install docker
$ sudo yum install -y docker
$ sudo service docker start
Add the AmazonEC2ContainerRegistryFullAccess IAM Policy to the Master EC2 instance:
Now, create a Dockerfile with the following contents:
FROM pcluster-alinux:latest

# Set the working directory to /app
WORKDIR /work

# Copy the current directory contents into the container at /app
COPY . /work
ENV PATH=$PATH:/usr/lib64/openmpi/bin/

# Install any needed packages specified in requirements.txt
RUN yum -y install awscli wget unzip gzip tar gcc gcc-g++ make
RUN yum -y install openmpi openmpi-devel
RUN yum groupinstall "Development Tools" -y

RUN wget https://github.com/hpcg-benchmark/hpcg/archive/master.zip

RUN unzip master.zip
RUN hpcg-master/configure Linux_MPI
RUN make
RUN chmod 755 /work/run.s

# Define environment variable
ENV INSTANCETYPE c5n.18xlarge
ENV CASE_CORES 36
ENV CASE_NAME run1
ENV CASE_SIZE 16
ENV CASE_TIME 20


ENTRYPOINT ["/parallelcluster/bin/entrypoint.sh"]
And a file run.s with the following contents:
#!/bin/sh

echo "case time, size and cores"
echo "CASE_NAME, $CASE_NAME"
echo "CASE_TIME, $CASE_TIME"
echo "CASE_SIZE, $CASE_SIZE"
echo "CASE_CORES, $CASE_CORES"

export PATH=.:$PATH
export OMPI_MCA_btl_vader_single_copy_mechanism=none

/usr/lib64/openmpi/bin/mpirun --allow-run-as-root -np $CASE_CORES -hostfile ${HOME}/hostfile /work/bin/xhpcg --nx=$CASE_SIZE --ny=$CASE_SIZE --nz=$CASE_SIZE --rt=$CASE_TIME

rating_string=$( grep "with a GFLOP/s rating" HPCG*)

length=${#rating_string}
rating=$(echo $rating_string | cut -c62-$length )

echo "rating=, $rating"
middle="_"
filename=$CASE_NAME$middle$CASE_CORES$middle$CASE_SIZE
echo "$CASE_NAME, $CASE_CORES, $CASE_SIZE, $CASE_TIME, $rating" > $filename
echo $filename
cat $filename
Build and push that dockerfile with
$ $(aws ecr get-login --no-include-email --region us-east-1) # login w/ ecr
$ make push
Now you can submit an HPCG run like:
$ awsbsub -e CASE_CORES=36 -n 2 -jn hpcg /work/run.s
Watch the job to see when it transitions into running:
$ watch awsbstat
...
jobId                                 jobName    status    startedAt    stoppedAt    exitCode
------------------------------------  ---------  --------  -----------  -----------  ----------
222e21bb-a955-42c8-a45a-6d195db740b6  hpcg       RUNNABLE  -            -            -
And get the output, after it transitions to RUNNING, with:
$ awsbout 222e21bb-a955-42c8-a45a-6d195db740b6