alkavan/rocky-linux-9-llama-chatbot.md

## rocky-linux-9-llama-chatbot.md

      
    Raw
  

              rocky-linux-9-llama-chatbot.md
            
          
    Rocky Linux 9 | Chatbot Edition

The following was tested on Google GCP utilizing an a2-highgpu-1g instance and Rocky Linux 9 image.

It has 80GB of RAM, 12 CPU cores, and a single NVIDIA A100 40GB GPU attached.

I also recommand taking 500GB SSD hard drive, it's somewhat more than required, but you might need it.
NOTICE: Make sure you have positive bank balance before trying.
Update the system:
sudo dnf update -y

Install my favorite editor:
sudo dnf install -y nano

Install some basic development tools:
sudo dnf groupinstall "Development Tools"
sudo dnf install python3-pip

NVIDIA Drivers

Next you need to install drivers for your GPU. I am ofcourse using NVIDIA A100-SXM4-40GB
but this should work almost for any decent datacenter GPU (NVIDIA).
Add EL9 compatible repository (Fedora):
sudo dnf config-manager --set-enabled crb
sudo dnf install \
  https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm \
  https://dl.fedoraproject.org/pub/epel/epel-next-release-latest-9.noarch.rpm
sudo dnf config-manager --add-repo \
  http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -i)/cuda-rhel9.repo

Install driver dependencies:
sudo dnf install kernel-headers-$(uname -r) kernel-devel-$(uname -r) tar bzip2 make automake gcc gcc-c++ \
  pciutils elfutils-libelf-devel libglvnd-opengl libglvnd-glx libglvnd-devel acpid pkgconfig dkms

Install NVIDIA GPU driver:
sudo dnf module install nvidia-driver:latest-dkms

Now it's a good time to reboot the system:
reboot

Check the driver installation worked:
nvidia-smi

You should see something like this a second later:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB           Off| 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0               51W / 400W|      0MiB / 40960MiB |     27%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

LLaMA Model Weights

Install support for large files for git:
sudo dnf install git-lfs
git lfs install

Clone LLaMA-13b model weights:
git clone https://huggingface.co/huggyllama/llama-13b

Create Vicuna-13b weights output directory:
mkdir vicuna-13b

FastChat

Clone FastChat repository:
git clone https://github.com/lm-sys/FastChat.git && cd FastChat

Upgrade pip (to enable PEP 660 support):
pip3 install --upgrade pip

Install package dependencies:
pip3 install -e .

Apply delta weights (will download repository):
python3 -m fastchat.model.apply_delta \
  --base-model-path ../llama-13b \
  --target-model-path ../vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1

Confirm weights output:
ls -alh ../vicuna-13b/

Run CLI prompt (single GPU):

python3 -m fastchat.serve.cli --model-path ../vicuna-13b

Run web interface

Install tmux for easy running multiple processes:
sudo dnf install -y tmux

Quick tmux Tutorial

To run tmux just type tmux in the shell.

The first window is created automatically.

To create another window ctrl + b then c.

To switch window ctrl + b then w and choose with arrows the window.
To detach ctrl + b then d.

To reattach latest session type tmux at in the shell.
Starting controller, worker(s), web interface

Run each of the servers in a different tmux window so you can switch between
them and also leave them running in interactive mode after you logout or disconnected.
Start the controller server:
python3 -m fastchat.serve.controller

Start the worker server (can run multiple workers, different models):
python3 -m fastchat.serve.model_worker --model-path ../vicuna-13b/

Add default web interface http port (7860) to firewall:
sudo firewall-cmd --add-port=7860/tcp
sudo firewall-cmd --add-port=7860/tcp --permanent

If you're using Google GCP you probably need to open ingress for port 7860!
Start the GUI web interface:
python3 -m fastchat.serve.gradio_web_server