Skip to content

Instantly share code, notes, and snippets.

@dm4
Last active October 17, 2023 08:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dm4/61208897f9e06f7214346f28b9f50c47 to your computer and use it in GitHub Desktop.
Save dm4/61208897f9e06f7214346f28b9f50c47 to your computer and use it in GitHub Desktop.
Run WasmEdge WASI-NN plugin on AWS EC2 Instance

Setup CUDA on AWS EC2 Instance

This tutorial is based on this AWS tutorial. In this tutorial, we will install Nvidia driver on AWS EC2 instance and compile and run llama.cpp on it.

Create an AWS EC2 instance

Here we use g5.4xlarge instance with Ubuntu 22.04 AMI, which use Nvidia A10G GPU.

Install Nvidia driver

First, we need to update the package list and install the latest kernel update.

sudo apt-get update -y
sudo apt-get upgrade -y linux-aws
sudo reboot

Rebuild the grub configuration:

sudo apt-get install -y gcc make linux-headers-$(uname -r)
cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
EOF
sudo sed -i 's/GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="rdblacklist=nouveau"/' /etc/default/grub
sudo update-grub

Download Nvidia driver from AWS S3

sudo apt install awscli
aws configure
aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
sudo sh NVIDIA-Linux-x86_64-535.104.05-grid-aws.run

Confirm the driver is installed successfully

nvidia-smi -q | head

Disable GSP and reboot. (More information here)

sudo touch /etc/modprobe.d/nvidia.conf
echo "options nvidia NVreg_EnableGpuFirmware=0" | sudo tee --append /etc/modprobe.d/nvidia.conf
sudo reboot

Install CUDA

wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run
sudo sh cuda_12.2.2_535.104.05_linux.run --silent --override --toolkit --samples --toolkitpath=/usr/local/cuda-12 --samplespath=/usr/local/cuda --no-opengl-libs

Build and Run llama.cpp

Install required packages

sudo apt install build-essential cmake ninja-build

Checkout and build llama.cpp

Here we use the same version of llama.cpp as the WasmEdge used.

git clone https://github.com/ggerganov/llama.cpp.git -b b1309
cd llama.cpp
cmake -Bbuild -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
cmake --build build

Download models

Here we download 3 llama2 models (7b and 13b), the 70b model is too large (46GB) to fit into the memory of A10G GPU (24GB).

curl -LO https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf
curl -LO https://huggingface.co/TheBloke/Llama-2-13b-Chat-GGUF/resolve/main/llama-2-13b-chat.Q5_K_M.gguf

Run llama.cpp

./build/bin/main -m llama-2-7b-chat.Q5_K_M.gguf -ngl 99 -n 512 -p 'Hello'
./build/bin/main -m llama-2-13b-chat.Q5_K_M.gguf -ngl 99 -n 512 -p 'Hello'

Performance

  • 7b model
llama_print_timings:        load time =  1212.33 ms
llama_print_timings:      sample time =   394.57 ms /    76 runs   (    5.19 ms per token,   192.62 tokens per second)
llama_print_timings: prompt eval time =    93.69 ms /     2 tokens (   46.84 ms per token,    21.35 tokens per second)
llama_print_timings:        eval time =  1028.71 ms /    75 runs   (   13.72 ms per token,    72.91 tokens per second)
llama_print_timings:       total time =  1584.08 ms
  • 13b model
llama_print_timings:        load time =  2175.20 ms
llama_print_timings:      sample time =  1137.54 ms /   220 runs   (    5.17 ms per token,   193.40 tokens per second)
llama_print_timings: prompt eval time =   145.10 ms /     2 tokens (   72.55 ms per token,    13.78 tokens per second)
llama_print_timings:        eval time =  5282.73 ms /   219 runs   (   24.12 ms per token,    41.46 tokens per second)
llama_print_timings:       total time =  6765.25 ms

Build and Run WasmEdge WASI-NN Plugin with GGML Backend

Install required packages

sudo apt install build-essential cmake ninja-build

Install WasmEdge with WASI-NN plugin

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml
source /home/ubuntu/.bashrc

Checkout and build WasmEdge WASI-NN plugin

NOTICE: We used the unmerged branch of WasmEdge to enable the cuBlas support.

git clone git@github.com:WasmEdge/WasmEdge.git -b hydai/enable_cublas
cd WasmEdge
cmake -Bbuild -GNinja -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_BUILD_TYPE=Release -DWASMEDGE_BUILD_AOT_RUNTIME=OFF -DWASMEDGE_BUILD_TOOLS=OFF -DWASMEDGE_PLUGIN_WASI_NN_BACKEND=GGML -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_BLAS=OFF -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_CUBLAS=ON
cmake --build build
cp build/plugins/wasi_nn/libwasmedgePluginWasiNN.so ~/.wasmedge/plugin/libwasmedgePluginWasiNN.so

Download models

Here we download 3 llama2 models (7b and 13b), the 70b model is too large (46GB) to fit into the memory of A10G GPU (24GB).

curl -LO https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf
curl -LO https://huggingface.co/TheBloke/Llama-2-13b-Chat-GGUF/resolve/main/llama-2-13b-chat.Q5_K_M.gguf

Download WasmEdge WASI-NN example WASM

curl -LO https://github.com/second-state/WasmEdge-WASINN-examples/raw/master/wasmedge-ggml-llama-interactive/wasmedge-ggml-llama-interactive.wasm

Run WasmEdge WASI-NN example WASM

LLAMA_N_GL=99 LLAMA_LOG=1 \
    wasmedge --dir .:. \
    --nn-preload default:GGML:CPU:llama-2-7b-chat.Q5_K_M.gguf \
    wasmedge-ggml-llama-interactive.wasm default

Performance

  • 7b model
llama_print_timings:        load time =  3325.66 ms
llama_print_timings:      sample time =     0.48 ms /    21 runs   (    0.02 ms per token, 44210.53 tokens per second)
llama_print_timings: prompt eval time =   106.76 ms /    49 tokens (    2.18 ms per token,   458.96 tokens per second)
llama_print_timings:        eval time =   273.70 ms /    20 runs   (   13.68 ms per token,    73.07 tokens per second)
llama_print_timings:       total time =  3600.52 ms
  • 13b model
llama_print_timings:        load time =  5607.37 ms
llama_print_timings:      sample time =     0.48 ms /    21 runs   (    0.02 ms per token, 43388.43 tokens per second)
llama_print_timings: prompt eval time =   165.15 ms /    49 tokens (    3.37 ms per token,   296.71 tokens per second)
llama_print_timings:        eval time =   477.82 ms /    20 runs   (   23.89 ms per token,    41.86 tokens per second)
llama_print_timings:       total time =  6085.66 ms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment