bryevdv/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    System

Linux bvhp 5.14.0-1042-oem #47-Ubuntu SMP Fri Jun 3 18:17:11 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

JAX

Install

Base env: conda create -n jax python=3.10
Docs: https://github.com/google/jax#installation


(CPU) pip install --upgrade "jax[cpu]"


(GPU) pip install --upgrade "jax[cuda]"


Both CPU and GPU pip installs were very quick and worked first try without any issues.
Code experiments

Quickstart

JAX has a quickstart page here:
https://jax.readthedocs.io/en/latest/notebooks/quickstart.html
All examples from the quickstart (covering jit, grad, and vmap) executed properly and showed expected results on th first try using the GPU package.
Convolutions

A more advanced tutorial covers convolutions:
https://jax.readthedocs.io/en/latest/notebooks/convolutions.html
This chapter I did run into some issues running examples out of the box. Running the first example resulted in errors Unimplemented: DNN library is not found. Some searching led to this issue. The solution (conda install cudnn) was simple but the DNN requirement was not mentioned on the page.
After this hiccup, all the examples ran without issue and produced expected output (including plots) on both Linux (GPU) and MacOS (CPU-only).
Multi-GPU

Another more advanced tutorial for jax.Array located here:
https://jax.readthedocs.io/en/latest/notebooks/jax_Array.html
These examples assume 8 devices, but my local machine only has 2. However, with minimal updates to function parameters to account for this, all examples worked out of the box on both Linux (GPU) and MacOS (CPU-only).
Multi-node

https://jax.readthedocs.io/en/latest/multi_process.html

Furthermore, you must manually run your JAX program on each host! JAX doesn’t automatically start multiple processes from a single program invocation.

I was not able to successfully run jax.distributed.initialize, it seemed to hang no matter what parameters were given, and eventually errored with:
XlaRuntimeError: DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: PjRT_Client_Connect

After some consultation, it was suggested that jax.distributed.initialize does not work in the standard REPL, despite that being what is demonstrated in their docs page above. The small example below was successful:
#  mpirun -np 2 python script.py

import jax
import platform

import jax.numpy as jnp
from mpi4py import MPI

rank = MPI.COMM_WORLD.Get_rank()
nproc = MPI.COMM_WORLD.Get_size()

if rank == 0:
    hostname = platform.node()
    MPI.COMM_WORLD.bcast(hostname, root=0)
else:
    hostname = MPI.COMM_WORLD.bcast(None, root=0)

addr = f"{hostname}:1234"

jax.distributed.initialize(addr, nproc, rank, local_device_ids=rank)

print(jax.device_count())
print(jax.local_device_count())

xs = jax.numpy.ones(jax.local_device_count())
print(jax.pmap(lambda x: jax.lax.psum(x, 'i'), axis_name='i')(xs))
Profiling

Docs: https://jax.readthedocs.io/en/latest/profiling.html
Buil-in decorator

with jax.profiler.trace("/tmp/jax-trace", create_perfetto_link=True):
  # Run the operations to be profiled
  key = jax.random.PRNGKey(0)
  x = jax.random.normal(key, (5000, 5000))
  y = x @ x
  y.block_until_ready()
This method pops up a UI to display the trace, but does block exectution until the UI is used to start execution.
NOTE: I was not able to get this method working due to this issue google/jax#13009
Tensorboard

More explicit start/stop API. Some issues w/ warnings about missing dependencies, and the Tensorboard UI also seemed a bit flaky.
Nsight

Docs mention Nsight is also an option, but only link to nsight docs (no examples, etc.)
Observations

Time to code

For single process, more or less instant/immediate after the install was completed. This was true on OSX (cpu-only) as well as Linux (GPU).
The one hiccup was for multi-node execution where some research was required to find a complete working example.
Docs

The JAX docs are very nice. While there is a plain API reference, the bulk of the docs are narrative and example-based topical user guide chapters. The prevalence of easily copy-pastable code blocks with expected outputs following helps users develop confidence that things are working as they progress through a chapter.
The docs also directly address valuable practical questions directly:

How can we be sure it’s actually running in parallel? We can do a simple timing experiment:

and, in general, provide discussions of applying JAX to practical use-cases.
The docs also make experimentation accessible by providing Google Colab links to notebooks for each chapter (sign-in required).
Where applicable, the docs also include nice plots.
UX features

It's also worth noting that JAX uses rich to afford very nice TUI output for visualizing things like device shardings:
z = jnp.sin(y)
jax.debug.visualize_array_sharding(z)
┌──────────┬──────────┐
│  TPU 0   │  TPU 1   │
├──────────┼──────────┤
│  TPU 2   │  TPU 3   │
├──────────┼──────────┤
│  TPU 6   │  TPU 7   │
├──────────┼──────────┤
│  TPU 4   │  TPU 5   │
└──────────┴──────────┘

Cunumeric

Install

Base env: conda create -n cn python=3.10
Docs: https://nv-legate.github.io/cunumeric/22.03/README.html#installation

conda install -c nvidia -c conda-forge -c legate cunumeric

This worked first try, ws a slightly long/large install.
"Read on for general instructions on building cuNumeric from source" is small and easy to miss, it's not completely obvious the dependencies are for source installs.
Code experiments

Not clear from docs where to go to see a working example.
Profiling

There are no docs or examples demonstrating how to profile that I could find at https://nv-legate.github.io/cunumeric/22.03/
There could be a demonstration of running a script with --profile enabled, the need to run python -m http.server on the output directoy, and then a screenshot showing the expected output in a browser.
Observations

There is no quickstart [1] directly in the docs, nor any pointer to example scripts in the repo. There is one link to a CFD notebook in the Overview.
[1] There is a "quickstart repo" but it seems to be poorly names, as "Scripts for building Docker images and libraries from source" does not really seem like quickstart material
There is no explicit discussion of how to run multi-node or multi-gpu, and almost no discssion of how to run legate. There is one note:

We encourage all users to familiarize themselves with these resource flags as described in the Legate Core documentation

but that note does not actually link to the Legate docs.
Rapids

Install

Base env: conda create -n rapids python=3.10
Docs: https://rapids.ai/start.html
Tried pip install first:

pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com

Did not work:

rapids ❯ pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
ERROR: Could not find a version that satisfies the requirement cudf-cu11 (from versions: 0.0.1)
ERROR: No matching distribution found for cudf-cu11

Went back to conda (new env to replace base env):

conda create -n rapids-22.10 -c rapidsai -c conda-forge -c nvidia  \ rapids=22.10 python=3.9 cudatoolkit=11.5

This worked but was an enormous and long (several minutes) install.
Code experiments

Docs: https://docs.rapids.ai/start and https://docs.rapids.ai/api/cudf/stable/user_guide/10min.html
"10 Minutes to cuDF and Dask-cuDF" had many example codes that could be copy-pasted directly in to the REPL. All of the examples worked without issue, and cover many Pandas features by way of direct comparison. A RHS navigation bar made it easy to jump to particular topics of interest.
There was as specific section covering Dask local CUDA cluster and observing GPU usage:
https://docs.rapids.ai/api/cudf/stable/user_guide/10min.html#dask-performance-tips
Profiling

Recommended to use Nsight. No information in main docs, but there is a blog post with detailed information:
https://developer.nvidia.com/blog/nvidia-tools-extension-api-nvtx-annotation-tool-for-profiling-code-in-python-and-c-c/
Usage seems explicit via decorators that users can add:
@nvtx.annotate(“f()”, color="purple")
def f():
    for i in range(5):
        with nvtx.annotate("loop", color="red"):
            time.sleep(i)
Observations

After some initial install hiccups, cudf and dask-dataframe could was immediately runnable in ipython REPL.
Cupy

Install

Base env: conda create -n cupy python=3.10
Docs: https://docs.cupy.dev/en/stable/install.html

pip install cupy-cuda11x

Quick/simple install worked first try.
Code experiments

Lots of inline code samples in several topical sections:

Basics of CuPy
User-Defined Kernels
Accessing CUDA Functionalities
Fast Fourier Transform with CuPy
Memory Management
Performance Best Practices
Interoperability
Differences between CuPy and NumPy

Code from these sections could generally be immediately run without issue after install. One exception was "Multi-GPU FFT" which is labeled experimental and seems to have a compat issue with latest scipy.
Profiling

Docs: https://docs.cupy.dev/en/stable/reference/generated/cupyx.profiler.profile.html#cupyx.profiler.profile
Decorator-based:
with cupyx.profiler.profile():
   # do something you want to measure
   pass
Assume this outputs traces that can be used by nvprof / nsys (which are both mentioned in passing) but could not find any detailed or step-by-step demonstration of profiling.
Observations

Time to code

Pretty much immediate, examples from the docs page could be run out of the box.
Multi-node (computelab) experience

ref: https://gitlab-master.nvidia.com/groups/legate/-/wikis/Cluster%20access%20information#computelab-nvidia

went to create scratch space, learned I did not yet have a unix account, submitted to create one

CLLR

Didn't get very far trying to use the quickstart before running into various issues.


trying https://gitlab-master.nvidia.com/legate/quickstart.internal#prometheus-nvidia


error with tcsh (default)

computelab-frontend-1:~/quickstart.internal> module load PrgEnv/GCC+OpenMPI/2021-05-27 cuda gcc openmpi
CORRECT>modules load PrgEnv/GCC+OpenMPI/2021-05-27 cuda gcc openmpi (y|n|e|a)? yes
/etc/modules: Permission denied.


error with bash

bvandeven@computelab-frontend-1:~/quickstart.internal$ module load PrgEnv/GCC+OpenMPI/2021-05-27 cuda gcc openmpi
module: command not found


trying https://gitlab-master.nvidia.com/legate/quickstart.internal#building-and-using-docker-images


error with make_image.sh

failed to dial gRPC: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: permission denied


discovered need to follow "bare metal" instructions further down, not container instructions that are at the top

issue here is that several steps are split over several different pages
need one top to bottom set of steps for users to follow in order
e.g. link to full docs for "creating conda envs" for context and further info, but give a reasonable specific conda command to execute right there inline to avoid disrupting flow.


was not running on scratch space, had to restart


Would be nice to advise how to check these versions

Make sure you use an environment file with a --ctk version matching the
system-wide CUDA version (i.e. the version provided by the CUDA module you
load). Most commonly on clusters you will want to use the system-provided
compilers and MPI implementation, therefore you will likely want to use an
environment generated with --no-compilers and --no-openmpi.


Had originally used the conda env from the separate link above. But that specifies --compilers and --openmpi Based on later instructions, backed that out and used ./scripts/generate-conda-envs.py --python 3.10 --ctk 11.7 --os linux  --no-compilers --no-openmpi 


LEGION_DIR=../legion/ ../quickstart.internal/build.sh fails:
CMake Error at /home/scratch.bvandeven_research_1/miniconda3/envs/legate/share/cmake-3.25/Modules/CMakeDetermineCUDACompiler.cmake:180 (message):
    Failed to find nvcc.

Compiler requires the CUDA toolkit.  Please set the CUDAToolkit_ROOT


was missing https://gitlab-master.nvidia.com/legate/quickstart.internal#computelab-nvidia

at least for my "new user trying thing things" state of mind, having this information out of band after the main "getting started steps" was counter productive. I got stuck before I ever saw this and therefore did not know it was the reason.


using run.sh on cholesky example

ORTE does not know how to route a message to the specified daemon
located on the indicated node:


manual ssh key fingerprint issue, this needs docs support if not fixed on admin site


finally running


Jax

Tried very simple example:
import jax

jax.distributed.initialize()

xs = jax.numpy.ones(jax.local_device_count())

result = jax.pmap(lambda x: jax.lax.psum(x, 'i'), axis_name='i')(xs)

According to Jax docs, no parameters need to be supplied in SLURM environments. Invoked as:
srun -N 2 --exclusive -p dgx-1v-multinode --pty python jax-test.py

However, this failed in obscure ways:
Traceback (most recent call last):
  File "/home/scratch.bvandeven_research_1/foo.py", line 3, in <module>
    jax.distributed.initialize()
  File "/home/scratch.bvandeven_research_1/miniconda3/envs/legate/lib/python3.10/site-packages/jax/_src/distributed.py", line 160, in initialize
    global_state.initialize(coordinator_address, num_processes, process_id, local_device_ids)
  File "/home/scratch.bvandeven_research_1/miniconda3/envs/legate/lib/python3.10/site-packages/jax/_src/distributed.py", line 80, in initialize
    self.client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: PjRT_Client_Connect
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1672961505.985774662","description":"Error received from peer ipv4:127.0.1.1:62724","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Barrier timed out. Barrier_id: PjRT_Client_Connect","grpc_status":4} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-01-05 15:31:46.066509: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:479] Failed to disconnect from coordination service with status: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1672961506.066464202","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3940,"referenced_errors":[{"created":"@1672961506.066461968","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":392,"grpc_status":14}]}. Proceeding with agent shutdown anyway.
srun: error: prm-dgx-02: task 0: Exited with exit code 1
srun: error: prm-dgx-03: task 1: Aborted (core dumped)

Dask

ref: https://docs.dask.org/en/stable/deploying.html


tried
from dask.distributed import Client````
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=8,
                       processes=4,
                       memory="16GB",
                       account="bvandeven",
                       walltime="01:00:00",
                       queue="all")
cluster.scale(2)


got

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Dask (RAPIDS)

ref: https://rapids.ai/hpc.html
More successful here, got the example on that page running after only ~10 minutes (there was an error in the code, reported to rapids folks). Details
scheduler

(rapids-22.10) bvandeven@computelab-frontend-1:/home/scratch.bvandeven_research_1/rapids$ cat scheduler.sh
#!/usr/bin/env bash

#srun -N 1 --exclusive -p dgx-1v-multinode --pty -J scheduler.sh -t 00:10:00 bash scheduler.sh

#module load cuda/11.0.3
CONDA_ROOT=/home/scratch.bvandeven_research_1/miniconda3
source $CONDA_ROOT/etc/profile.d/conda.sh
conda activate rapids-22.10

LOCAL_DIRECTORY=/home/scratch.bvandeven_research_1/rapids/tmp
mkdir $LOCAL_DIRECTORY
CUDA_VISIBLE_DEVICES=0 dask-scheduler \
    --protocol tcp \
    --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" &

dask-cuda-worker \
    --rmm-pool-size 14GB \
    --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json"

workers

(base) bvandeven@computelab-frontend-1:/home/scratch.bvandeven_research_1/rapids$ cat worker.sh
#!/usr/bin/env bash

#srun -N 2 --exclusive -p dgx-1v-multinode --pty -J worker.sh -t 00:10:00 bash worker.sh ]

#module load cuda/11.0.3
CONDA_ROOT=/home/scratch.bvandeven_research_1/miniconda3
source $CONDA_ROOT/etc/profile.d/conda.sh
conda activate rapids-22.10

LOCAL_DIRECTORY=/home/scratch.bvandeven_research_1/rapids/tmp
mkdir $LOCAL_DIRECTORY
dask-cuda-worker \
    --rmm-pool-size 14GB \
    --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json"

example code

bvandeven@computelab-frontend-1:/home/scratch.bvandeven_research_1/rapids$ cat example.sh
#!/usr/bin/env bash

#srun -N 1 --exclusive -p dgx-1v-multinode --pty -J example.sh -t 00:10:00 bash example.sh

#module load cuda/11.0.3
CONDA_ROOT=/home/scratch.bvandeven_research_1/miniconda3
source $CONDA_ROOT/etc/profile.d/conda.sh
conda activate rapids-22.10

LOCAL_DIRECTORY=/home/scratch.bvandeven_research_1/rapids/tmp

cat <<EOF >>/tmp/dask-cudf-example2.py
import cudf
import dask.dataframe as dd
from dask.distributed import Client

client = Client(scheduler_file="$LOCAL_DIRECTORY/dask-scheduler.json")
cdf = cudf.datasets.timeseries()

ddf = dd.from_pandas(cdf, npartitions=10)
res = ddf.groupby(['id', 'name']).agg(['mean', 'sum', 'count']).compute()
print(res)
EOF

python /tmp/dask-cudf-example2.py