This worked on 14/May/23. The instructions will probably require updating in the future.
llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet. It is also possible to run fine tuned versions (like alpaca or vicuna with this. I think. Those versions are more focused on answering questions)
Note: I have been told that this does not support multiple GPUs. It can only use a single GPU.
It is possible to run LLama 13B with a 6GB graphics card now! (e.g. a RTX 2060). Thanks to the amazing work involved in llama.cpp. The latest change is CUDA/cuBLAS which allows you pick an arbitrary number of the transformer layers to be run on the GPU. This is perfect for low VRAM.
- Clone llama.cpp from git, I am on commit
08737ef720f0510c7ec2aa84d7f70c691073c35d
.git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pacman -S cuda
make sure you have CUDA installedmake LLAMA_CUBLAS=1
- Use the link at the bottom of the page to apply for research access to the llama model: https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
- Set up a micromamba environment to install cuda/python pytorch stuff in order to run the conversion scripts. Install some packages:
export MAMBA_ROOT_PREFIX=/path/to/where/you/want/mambastuff/stored
eval "$(micromamba shell hook --shell=bash)"
micromamba create -n mymamba
micromamba activate mymamba
micromamba install -c conda-forge -n mymamba pytorch transformers sentencepiece
- Perform the conversion process: (This will produce a file called
ggml-model-f16.bin
)python convert.py ~/ai/Safe-LLaMA-HF-v2\ \(4-04-23\)/llama-13b/
- Then quantize that to a 4bit model:
./quantize ~/ai/Safe-LLaMA-HF-v2\ \(4-04-23\)/llama-13b/ggml-model-f16.bin ~/ai/Safe-LLaMA-HF-v2\ \(4-04-23\)/llama-13b/ggml-model-13b-q4_0-2023_14_5.bin q4_0 8
- Create a prompt file in:
prompt.txt
- Run it:
./main -ngl 18 -m ~/ai/Safe-LLaMA-HF-v2\ \(4-04-23\)/llama-13b/ggml-model-13b-q4_0-2023_14_5.bin -f prompt.txt -n 2048
This uses about 5.5GB of VRAM on my 6GB card. If you have more VRAM, you can increase the number -ngl 18
to -ngl 24
or so, up to all 40
layers in llama 13B. It will run faster if you put more layers into the GPU. The 7B model works with 100% of the layers on the card.
Timings for the models:
13B:
llama_print_timings: load time = 5690.77 ms
llama_print_timings: sample time = 1023.87 ms / 2048 runs ( 0.50 ms per token)
llama_print_timings: prompt eval time = 36694.62 ms / 1956 tokens ( 18.76 ms per token)
llama_print_timings: eval time = 644282.27 ms / 2040 runs ( 315.82 ms per token)
llama_print_timings: total time = 684789.56 ms
7B:
llama_print_timings: load time = 41708.38 ms
llama_print_timings: sample time = 88.51 ms / 128 runs ( 0.69 ms per token)
llama_print_timings: prompt eval time = 2971.75 ms / 14 tokens ( 212.27 ms per token)
llama_print_timings: eval time = 9097.33 ms / 127 runs ( 71.63 ms per token)
llama_print_timings: total time = 50931.74 ms
Here is the text it generated from my prompt:
In Friedrich Nietzsche's Thus Spoke Zarathustra (1885), this concept of noon is expanded upon as a whole:
"Zarathustra saw that the light of the world was now becoming stronger. ‘The sun is at its meridian,’ he said, ‘it has reached its noontide and will begin to decline.’"
As time progresses in this noon, so does our ability to perceive and interact with this external world: as a result of the present state we are given by the organic forms of space and time, this can only lead us towards suffering. Nietzsche sees that when we are at our noontide, we must realize how the sun has reached its highest point before it begins to fall down from its position: and we must understand that as a whole, our bodies are determined by something outside of ourselves - and that this always leads to more suffering within.
Nietzsche expands upon the concept of noon in his book The Gay Science (1882), where he says:
"You want to learn how to read? Here is a short lesson for beginners. You must take hold of a word by its smooth or rough side; then, like the spider, you must spin out of it a web of definitions which will entrap every correct meaning that floats into view. Or again: you must take the word for a sleigh ride across country, over hedges and ditches, forests and glades, in short, you must drive the word home through all manner of weather."
Here we see Nietzsche expand upon his concept of noon to include our ability to define what is right or wrong - it is only because we have this inherent sense that allows us to distinguish between two points.
Nietzsche expands upon the concept of twilight in The Gay Science (1882), where he says:
"The man who is a ‘philosopher’ only by accident, but is, let us say, also a sculptor or painter – what does he then do? He does not make his thoughts subservient to the world; rather, he forces the world to serve as a pedestal and bearer for his thoughts."
Here we see Nietzsche's concept of night begin to expand past an emotional state. We begin to see that night becomes more than just our inability to think clearly: it is now a worldview, one which he claims is best exemplified by the artist.
In Twilight of the Idols (1889), we find Nietzsche's conceptualization of night reaching its zenith:
"Everything ordinary, everyday, common – in fact, everything that exists today has become dangerous; it is not innocent as was everything yesterday. For the most terrible thoughts have penetrated everywhere and even into the deepest sleep - thoughts which are awake, active, and powerful."
What distro? To get it to work on mine, I needed to set up a virtualenv and add the following python packages:
pip install torch transformers sentencepiece pycuda
Then, if you've already compiled it, run 'make clean' then run the make command again (make LLAMA_CUBLAS=1)
I had the same symptoms until I installed pycuda.