Create a gist now

Instantly share code, notes, and snippets.

@Atlas7 /mcdram.md Secret
Last active Aug 8, 2017

What would you like to do?
Intel Colfax Cluster - How to run an application in High Bandwidth Mode (HBM) on Xeon Phi (Knights Landing) enabled Cluster Node

Intel Colfax Cluster - Notes - Index Page


Introduction

The Xeon Phi (Knights Landing / KNL) Host processor gives us a way to take advantage of the High Bandwidth Mode (HBM) Architecture to speed up applications. This post illustrates an example for us to take advantage of the Multi Channel Dynamic RAM (MCDRAM), which has bandwidth of up to 480 GB/s. This is ~5 times higher than the DDR4RAM bandwidth (up to 90 GB/s). One thing to be aware of is that the (faster) MCDRAM has only storage of up to 16 GiB, whereas the slower DDR4RAM has (bigger) storage of up to 384 GiB. If our application can fit within 16 GiB memory however, it may potentially run faster on a (small but higher bandwidth) MCDRAM than (the big but lower bandwidth) DDR4RAM.

This post illustrates an example "Hello world" C++ application that runs on MCDRAM, with the aid of hbwmalloc - The high bandwidth memory interface, part of the memkind library.

Xeon Phi (Knights Landing / KNL) Architecture Revisit

Let's revisit the KNL Architecture. Borrowing the slides from Colfax Research How-to Deep Dive Series, the following diagram show the bootable KNL Processor Memory Organization:

knl-numa-1.png

Notice a couple of thing:

  • DDR4RAM: up to 384 GiB storage, and up to 90 GB/s bandwidth (Stream)
  • MCDRAM: up to 16 GiB storage, and 480 GB/s bandwidth (Stream)

This gives us an idea - if our application fits 16 GiB memory, we could potentially use MCDRAM, which has ~5 times higher bandwidth than DDR4RAM.


Import notes:

  • 1 GB = 10^9 bytes = 1,000,000,000 bytes
  • 1 GiB = 2^30 bytes = 1,073,741,824 bytes

The diagram uses GB/s (GB per second) for memory bandwidth, and GiB for storage. Though actual numeric values turn out to be somewhat similar, there is a difference between GB and GiB. See GB to GiB conversion for more on GB vs GiB.


The following diagram shows the High Bandwidth Memory Modes (Flat / Cache / Hybrid):

knl-numa-2.png

At the time of writing, all the KNL nodes that I could "see" are configured as "Flat" modes. This post focuses on "Flat" Mode, in which the total resource available is made up of NUMA Node 0 (CPU + DDR4RAM) and NUMA Node 1 (no CPU + MCDRAM).

Step 1 - Create C++ Code

SSH to the Colfax Cluster login node.

Navigate to a working directory of your choice. In my case: /home/u4443/deepdive/lec-02 (tweak as you wish). We call this our "working directory" from now on.

Create a C++ code hello-mcdram.cc:

#include <cstdlib>
#include <cstdio>
#include <hbwmalloc.h>

int main() {
  printf("hello-mcdram starts.\n");
  long  N = 1L << 33L;  // 8 GiB
  char *A = (char*)hbw_malloc(N);
  for (long i = 0; i < N; i++)
    A[i] = 1;
  printf("hello-mcdram ends.\n");
}

Notice in this C++ code we do the following to enable program to run on MCDRAM:

  • we include the hbwmalloc.h directive with #include <hbwmalloc.h>
  • use hbw_malloc() to allocate 8 GiB of memory on the MCDRAM, with the handle (pointer) A. It contains 2 to the power of 33 characters = 8,589,934,592 characters (i.e. 8 Gi). As each each character takes up 1 byte, this takes up 8,589,934,592 bytes = 8 GiB. See this snapshot about how bitwise operator << work in this code.
  • we use a for loop to assign a value to each index of A.

If we were however want to use the default DDR4RAM instead, we can:

  • ommit the line #include <hbwmalloc.h>
  • replace hbw_malloc(N) with the standard malloc(N)

Step 2 - Compile C++ Code with memkind (on KNL Node)

Now comes the compilation of the code. To compile this code we need to have memkind already installed on the node. The login node doesn't come with memkind pre-installed so we'd need to compile this code on a cluster (Knights Landing) node instead.

Lets "get into" a cluster node with interactive qsub. Use the -I for the interactive option, and -l to select 1 KNL node with flat topology:

[u4443@c001 lec-02]$ qsub -I -l nodes=1:knl:flat
qsub: waiting for job 21134.c001 to start
qsub: job 21134.c001 ready


########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Tue Aug  8 02:34:49 PDT 2017
#    Job ID:           21134.c001
#      User:           u4443
# Resources:           neednodes=1:knl:flat,nodes=1:knl:flat,walltime=24:00:00
########################################################################

[u4443@c001-n029 ~]$

Now we are on the cluster node, navigate back to our working directory where the C++ code is stored:

[u4443@c001-n029 ~]$ cd deepdive/lec-02/

[u4443@c001-n029 lec-02]$ pwd
/home/u4443/deepdive/lec-02

We compile the C++ code hello-mcdram.cc with the Intel C++ Compiler (icpc), along with the -lmemkind option (this is required for MCDRAM related code. Otherwise we may ommit). This will output a binary executable hello-mcdram:

[u4443@c001-n029 lec-02]$ icpc -o hello-mcdram hello-mcdram.cc -lmemkind

We have now created an executable binary hello-mcdram. We can now logout and get back to the login node:

[u4443@c001-n029 lec-02]$ logout

qsub: job 21134.c001 completed

Step 3 - Create a Shell Script

Now write a shell script that will invoke this binary. Call it hello-mcdram.sh (tweak the current working directory in the script to where the newly compiled binary is stored). A bit of disclaimer: this script is likely not the most elegant, but may us visualize MCDRAM usage easily.

#PBS -l nodes=1:flat:knl

echo "Yo. This job is running on compute node "`hostname`
cd
cd /home/u4443/deepdive/lec-02
echo "run hello-mcdram now and do numstat."
./hello-mcdram &
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
numastat -p $!
echo "sleep for 5 seconds"
sleep 5s
numastat -p $!
numastat -p $!

A bit of explanation regarding this shell script:

  • If you recall from the Colfax get started documentation, the first line #PBS -l nodes=1:flat:knl asks our Colfax Cluster to run our shell script on 1 node, and that node needs to be a KNL node, with flat topology.
  • The first echo prints out the KNL node that our script actually runs on. It can be handy for understanding and debugging.
  • We navigate to the directory where the binary hello-mcdram
  • We execute the binary hello-mcdram in the background. Hence the & at the end.
  • We immediately run numastat -p $! to retrieve the Process ID of the job hello-mcdram. Here, $! means "the newly created PID".
  • We issue lots of numastat -p $! to give us a better chance of capturing / visualise NUMA Node 0 (DDR4RAM) and Node 1 (MCDRAM) memory usage. We hope to be able to see an incease of NUMA Node1 (MCDRAM) memory usage.
  • We issue a sleep command for a better chance to let the job hello-mcdram finishes, which happens in the background. (If we don't have a sleep, we may not get any output from the printf from the code.). There are probably better way to handle this instead of abusing the sleep command. For this illustration though the result does suggest the use of sleep is fit-for-purpose.

Step 4 - Run Shell Script on KNL Node via QSUB

Ensure we are at our working directory. We can now submit the shell script to run on a cluster node via qsub:

[u4443@c001 lec-02]$ qsub hello-mcdram.sh -l nodes=1:knl:flat
21137.c001

Note, if we were not at our working directory we would have to submit our shell script with the full path, like this (your working directory may be different):

[u4443@c001 lec-02]$ qsub /home/u4443/deepdive/lec-02/hello-mcdram.sh -l nodes=1:knl:flat

In our working directory, view our output hello-mcdram.sh.o21137 (change the job ID to the one you get). Note the increase of Node 1 (MCDRAM) "private". Towards the end we also see our C++ code started and finished.

[u4443@c001 lec-02]$ cat hello-mcdram.sh.o21137

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Tue Aug  8 03:00:23 PDT 2017
#    Job ID:           21137.c001
#      User:           u4443
# Resources:           neednodes=1:knl:flat,nodes=1:knl:flat,walltime=24:00:00
########################################################################

Yo. This job is running on compute node c001-n029
run hello-mcdram now and do numstat.

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.02            0.00            0.02
Private                      0.20            0.00            0.20
----------------  --------------- --------------- ---------------
Total                        0.22            0.00            0.22

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.02            0.00            0.02
Private                      1.15            0.00            1.15
----------------  --------------- --------------- ---------------
Total                        1.17            0.00            1.17

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                      5.40            0.00            5.40
----------------  --------------- --------------- ---------------
Total                        5.44            0.00            5.44

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                      9.44            0.00            9.44
----------------  --------------- --------------- ---------------
Total                        9.48            0.00            9.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45           14.00           31.45
----------------  --------------- --------------- ---------------
Total                       17.48           14.00           31.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45           34.00           51.45
----------------  --------------- --------------- ---------------
Total                       17.48           34.00           51.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45           54.00           71.45
----------------  --------------- --------------- ---------------
Total                       17.48           54.00           71.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45           80.00           97.45
----------------  --------------- --------------- ---------------
Total                       17.48           80.00           97.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          112.00          129.45
----------------  --------------- --------------- ---------------
Total                       17.48          112.00          129.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          142.00          159.45
----------------  --------------- --------------- ---------------
Total                       17.48          142.00          159.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          176.00          193.45
----------------  --------------- --------------- ---------------
Total                       17.48          176.00          193.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          206.00          223.45
----------------  --------------- --------------- ---------------
Total                       17.48          206.00          223.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          236.00          253.45
----------------  --------------- --------------- ---------------
Total                       17.48          236.00          253.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          268.00          285.45
----------------  --------------- --------------- ---------------
Total                       17.48          268.00          285.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          288.00          305.45
----------------  --------------- --------------- ---------------
Total                       17.48          288.00          305.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          310.00          327.45
----------------  --------------- --------------- ---------------
Total                       17.48          310.00          327.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          330.00          347.45
----------------  --------------- --------------- ---------------
Total                       17.48          330.00          347.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          352.00          369.45
----------------  --------------- --------------- ---------------
Total                       17.48          352.00          369.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          372.00          389.45
----------------  --------------- --------------- ---------------
Total                       17.48          372.00          389.48

Per-node process memory usage (in MBs) for PID 98761 (hello-mcdram)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.02            0.00            0.02
Stack                        0.02            0.00            0.02
Private                     17.45          394.00          411.45
----------------  --------------- --------------- ---------------
Total                       17.48          394.00          411.48
sleep for 5 seconds
hello-mcdram starts.
hello-mcdram ends.

Per-node process memory usage (in MBs) for PID 98761 ((null))

Per-node process memory usage (in MBs) for PID 98761 ((null))

########################################################################
# Colfax Cluster
# End of output for job 21137.c001
# Date: Tue Aug  8 03:00:29 PDT 2017
########################################################################

[u4443@c001 lec-02]$

(To repeat...) Note the increase of Node 1 (MCDRAM) "private" over time. Towards the end we also see our C++ code started and finished.


Note to self: I was hoping to see NUMA node 1 (MCDRAM) memory usage to increase to 8 GB before job finishes. (I wonder how to...)


Note that the last two lines in the output:

Per-node process memory usage (in MBs) for PID 98761 ((null))

This is nothing to be alarmed of. It's just that as since the C++ code has finished running (PID 98761), the numastat command of course, won't be able to query that PID 98761. In fact, this may help giving us a pretty nice full picture

Let's check the error file.

[u4443@c001 lec-02]$ cat hello-mcdram.sh.e21137
Can't read /proc/98761/numa_maps: No such file or directory
Can't read /proc/98761/numa_maps: No such file or directory
[u4443@c001 lec-02]$

The two errors corresponds to what we've just discussed regarding the two lines in the output. i.e. PID 98761 has finished running hence numastat against that PID won't work. Nothing to be alarmed of. As expected.

This summarises our MCDRAM programming and visualization!

To MCDRAM or not to MCDRAM

When should we use or not to use MCDRAM to run our job? This Colfax slide gives us a rule of thumb:

numa-3.png

Conclusion

In this post we have illustrated a simple example executing a C++ "hello world" binary taking advantage of the High Bandwidth Memory (HBM) Architecture. The Multi-channel Dynamic RAM (MCDRAM) was used to speed up job. A visulisation was performed to show NUMA node 1 (MCDRAM) usage increases during job run. A separate post on comparing performance with DDR4RAM equivalent code may make the justification of using MDCRAM even more convincing.

References


Intel Colfax Cluster - Notes - Index Page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment