Skip to content

Instantly share code, notes, and snippets.

@lxe
Last active February 27, 2024 21:19
Show Gist options
  • Save lxe/82eb87db25fdb75b92fa18a6d494ee3c to your computer and use it in GitHub Desktop.
Save lxe/82eb87db25fdb75b92fa18a6d494ee3c to your computer and use it in GitHub Desktop.
How to get oobabooga/text-generation-webui running on Windows or Linux with LLaMa-30b 4bit mode via GPTQ-for-LLaMa on an RTX 3090 start to finish.

How to get oobabooga/text-generation-webui running on Windows or Linux with LLaMa-30b 4bit mode via GPTQ-for-LLaMa on an RTX 3090 start to finish.

This guide actually works well for linux too. Just don't bother with the powershell envs

  1. Download prerequisites

  2. (Windows Only) Open the Conda Powershell.

    • Alternatively, open the regular PowerShell and activate the Conda environment:
      pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"'
    • Sometimes for some reason the GPTQ compilation fails if 'cl' is not in the path. You can try using the x64 Native Tools Command Prompt for VS 2019 shell instead or, load both conda and VS build tools shell like this:
      cmd /k '"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvars64.bat" && pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"'
  3. You'll need the CUDA compiler and torch that matches the version in order to build the GPTQ extesions which allows for 4 bit prequantized models. Create a conda env and install python, cuda, and torch that matches the cuda version, as well as ninja for fast compilation

    conda create -n tgwui
    conda activate tgwui
    conda install python=3.10

    Installing pytorch and cuda is the hardest part of machine learning I've come up with this install line from the following sources:

    conda install cuda pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia/label/cuda-11.7.0
    python -c 'import torch; print(torch.cuda.is_available())'
  4. Download text-generation-webui and GPTQ-for-LLaMa

    git clone https://github.com/oobabooga/text-generation-webui.git
    cd text-generation-webui
    mkdir repositories
    cd repositories
    git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git
    cd GPTQ-for-LLaMa
  5. Build and install gptq package and CUDA kernel (you should be in the GPTQ-for-LLaMa directory)

    pip install ninja
    python setup_cuda.py install
    
  6. Install the text-generation-webui dependencies

    cd ../..
    pip install -r requirements.txt
    
  7. Download the 13b model from huggingface

    python download-model.py decapoda-research/llama-13b-hf
    

    This will take some time. After it's done, rename the folder to llama-13b

    The llama-13b prequantized is available here. Download the llama-13b-4bit.pt file and place it in models directory, alongside the llama-13b folder.

  8. Run the text-generation-webui with llama-13b to test it out

    python server.py --cai-chat --load-in-4bit --model llama-13b --no-stream
    
  9. Download the hf version 30b model from huggingface

    python download-model.py decapoda-research/llama-30b-hf
    

    You can download the pre-quantized 4 bit versions of the model here.

    Alternatively, you'll need to quantize it yourself using GPTQ-for-LLaMa (this will take a while):

    cd ../repositories/GPTQ-for-LLaMa
    pip install datasets
    HUGGING_FACE_HUB_TOKEN={your huggingface token} CUDA_VISIBLE_DEVICES=0 python llama.py ../../models/llama-30b-hf c4 --wbits 4 --save llama-30b-4bit.pt
    

    Place the llama30b-4bit.pt in models in models directory, alongside the llama-30b folder.

  10. Run the text-generation-webui with llama-30b

    python server.py --cai-chat --load-in-4bit --model llama-30b --no-stream
    
@rayone
Copy link

rayone commented May 12, 2023

Thanks, however there is no setup_cuda.py;
(base) PS D:\AI\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa> ls

Directory: D:\AI\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa

Mode LastWriteTime Length Name


d---- 2023-05-12 14:44 quant
d---- 2023-05-12 14:44 utils
-a--- 2023-05-12 14:44 14 .gitignore
-a--- 2023-05-12 14:44 52 .style.yapf
-a--- 2023-05-12 14:44 1146 convert_llama_weights_to_hf.py
-a--- 2023-05-12 14:44 7695 gptq.py
-a--- 2023-05-12 14:44 11558 LICENSE.txt
-a--- 2023-05-12 14:44 13161 llama_inference_offload.py
-a--- 2023-05-12 14:44 4414 llama_inference.py
-a--- 2023-05-12 14:44 20094 llama.py
-a--- 2023-05-12 14:44 16397 neox.py
-a--- 2023-05-12 14:44 17943 opt.py
-a--- 2023-05-12 14:44 9006 README.md
-a--- 2023-05-12 14:44 181 requirements.txt

@poisenbery
Copy link

For windows, python -c 'import torch; print(torch.cuda.is_available())' needs to be

python -c "import torch; print(torch.cuda.is_available())"

also, for windows, dont you need to use the oobobooga fork of GPTQ for Llama? https://github.com/oobabooga/GPTQ-for-LLaMa/

@MiningMama71
Copy link

This is really great, easy to follow. I'm getting an issue when I finally try to test 13b, (step 8), where I get:

server.py: error: unrecognized arguments: --cai-chat

I know this is something you specifically wrote, I've seen posts on that, but have no idea what I missed to get it added to my installation.

Ideas? Thank you!

@sprintcheng
Copy link

May I know how large memory it needed for linux OS, as I always failed at last step, the process being killed might due to memory limit exceeded

(textgen) [root@orlop1 text-generation-webui]# python server.py --chat --load-in-4bit --model llama-13b --listen-host http://hostname --listen-port 7860
/root/miniconda3/envs/textgen/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/root/miniconda3/envs/textgen/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
2023-08-02 21:30:43 INFO:Loading llama-13b...
2023-08-02 21:30:43 WARNING:torch.cuda.is_available() returned False. This means that no GPU has been detected. Falling back to CPU mode.
Loading checkpoint shards: 59%|█████████████████████████████████████████████████████████████████████████▏ | 24/41 [02:47<02:14, 7.90s/it]Killed

Regards

@RamiroPrather
Copy link

people, thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment