dm4/cuda.md

## cuda.md

      
    Raw
  

              cuda.md
            
          
    Setup CUDA on AWS EC2 Instance

This tutorial is based on this AWS tutorial.
In this tutorial, we will install Nvidia driver on AWS EC2 instance and compile and run llama.cpp on it.
Create an AWS EC2 instance

Here we use g5.4xlarge instance with Ubuntu 22.04 AMI, which use Nvidia A10G GPU.
Install Nvidia driver

First, we need to update the package list and install the latest kernel update.
sudo apt-get update -y
sudo apt-get upgrade -y linux-aws
sudo reboot
Rebuild the grub configuration:
sudo apt-get install -y gcc make linux-headers-$(uname -r)
cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
EOF
sudo sed -i 's/GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="rdblacklist=nouveau"/' /etc/default/grub
sudo update-grub
Download Nvidia driver from AWS S3
sudo apt install awscli
aws configure
aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
sudo sh NVIDIA-Linux-x86_64-535.104.05-grid-aws.run
Confirm the driver is installed successfully
nvidia-smi -q | head
Disable GSP and reboot. (More information here)
sudo touch /etc/modprobe.d/nvidia.conf
echo "options nvidia NVreg_EnableGpuFirmware=0" | sudo tee --append /etc/modprobe.d/nvidia.conf
sudo reboot
Install CUDA

wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run
sudo sh cuda_12.2.2_535.104.05_linux.run --silent --override --toolkit --samples --toolkitpath=/usr/local/cuda-12 --samplespath=/usr/local/cuda --no-opengl-libs

  
## llama-cpp.md

      
    Raw
  

              llama-cpp.md
            
          
    Build and Run llama.cpp

Install required packages

sudo apt install build-essential cmake ninja-build
Checkout and build llama.cpp

Here we use the same version of llama.cpp as the WasmEdge used.
git clone https://github.com/ggerganov/llama.cpp.git -b b1309
cd llama.cpp
cmake -Bbuild -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
cmake --build build
Download models

Here we download 3 llama2 models (7b and 13b), the 70b model is too large (46GB) to fit into the memory of A10G GPU (24GB).
curl -LO https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf
curl -LO https://huggingface.co/TheBloke/Llama-2-13b-Chat-GGUF/resolve/main/llama-2-13b-chat.Q5_K_M.gguf
Run llama.cpp

./build/bin/main -m llama-2-7b-chat.Q5_K_M.gguf -ngl 99 -n 512 -p 'Hello'
./build/bin/main -m llama-2-13b-chat.Q5_K_M.gguf -ngl 99 -n 512 -p 'Hello'
Performance


7b model

llama_print_timings:        load time =  1212.33 ms
llama_print_timings:      sample time =   394.57 ms /    76 runs   (    5.19 ms per token,   192.62 tokens per second)
llama_print_timings: prompt eval time =    93.69 ms /     2 tokens (   46.84 ms per token,    21.35 tokens per second)
llama_print_timings:        eval time =  1028.71 ms /    75 runs   (   13.72 ms per token,    72.91 tokens per second)
llama_print_timings:       total time =  1584.08 ms

13b model

llama_print_timings:        load time =  2175.20 ms
llama_print_timings:      sample time =  1137.54 ms /   220 runs   (    5.17 ms per token,   193.40 tokens per second)
llama_print_timings: prompt eval time =   145.10 ms /     2 tokens (   72.55 ms per token,    13.78 tokens per second)
llama_print_timings:        eval time =  5282.73 ms /   219 runs   (   24.12 ms per token,    41.46 tokens per second)
llama_print_timings:       total time =  6765.25 ms

  
## wasmedge.md

      
    Raw
  

              wasmedge.md
            
          
    Build and Run WasmEdge WASI-NN Plugin with GGML Backend

Install required packages

sudo apt install build-essential cmake ninja-build
Install WasmEdge with WASI-NN plugin

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml
source /home/ubuntu/.bashrc
Checkout and build WasmEdge WASI-NN plugin

NOTICE: We used the unmerged branch of WasmEdge to enable the cuBlas support.
git clone git@github.com:WasmEdge/WasmEdge.git -b hydai/enable_cublas
cd WasmEdge
cmake -Bbuild -GNinja -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_BUILD_TYPE=Release -DWASMEDGE_BUILD_AOT_RUNTIME=OFF -DWASMEDGE_BUILD_TOOLS=OFF -DWASMEDGE_PLUGIN_WASI_NN_BACKEND=GGML -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_BLAS=OFF -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_CUBLAS=ON
cmake --build build
cp build/plugins/wasi_nn/libwasmedgePluginWasiNN.so ~/.wasmedge/plugin/libwasmedgePluginWasiNN.so
Download models

Here we download 3 llama2 models (7b and 13b), the 70b model is too large (46GB) to fit into the memory of A10G GPU (24GB).
curl -LO https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf
curl -LO https://huggingface.co/TheBloke/Llama-2-13b-Chat-GGUF/resolve/main/llama-2-13b-chat.Q5_K_M.gguf
Download WasmEdge WASI-NN example WASM

curl -LO https://github.com/second-state/WasmEdge-WASINN-examples/raw/master/wasmedge-ggml-llama-interactive/wasmedge-ggml-llama-interactive.wasm
Run WasmEdge WASI-NN example WASM

LLAMA_N_GL=99 LLAMA_LOG=1 \
    wasmedge --dir .:. \
    --nn-preload default:GGML:CPU:llama-2-7b-chat.Q5_K_M.gguf \
    wasmedge-ggml-llama-interactive.wasm default
Performance


7b model

llama_print_timings:        load time =  3325.66 ms
llama_print_timings:      sample time =     0.48 ms /    21 runs   (    0.02 ms per token, 44210.53 tokens per second)
llama_print_timings: prompt eval time =   106.76 ms /    49 tokens (    2.18 ms per token,   458.96 tokens per second)
llama_print_timings:        eval time =   273.70 ms /    20 runs   (   13.68 ms per token,    73.07 tokens per second)
llama_print_timings:       total time =  3600.52 ms

13b model

llama_print_timings:        load time =  5607.37 ms
llama_print_timings:      sample time =     0.48 ms /    21 runs   (    0.02 ms per token, 43388.43 tokens per second)
llama_print_timings: prompt eval time =   165.15 ms /    49 tokens (    3.37 ms per token,   296.71 tokens per second)
llama_print_timings:        eval time =   477.82 ms /    20 runs   (   23.89 ms per token,    41.86 tokens per second)
llama_print_timings:       total time =  6085.66 ms