Manoz/Readme.md

## Readme.md

      
    Raw
  

              Readme.md
            
          
    Running LLaMAs models locally on Apple Silicon M1/M2 chips using a nice Web UI


Disclaimer: I'm not a data scientist or an expert with LLaMA models or LLMs thus I won't cover technical details about LLaMA models, Vicuna or the settings used in my tests. I just wanted to play with LLaMA models and share my setup and results with the community.

I also consider that you have some basic knowledge about using a terminal and running Python scripts.

I will also not cover the installation process of the tools and libraries used in this document but I will provide links to the documentation I used to make this work on my computer.

Finally, I'm not a native English speaker so please excuse my English mistakes 🙃

Introduction:

I wanted to try to run a LLaMA model on my computer. Since I had absolutely no knowledge about this I started by reading a lot of documentation and articles on the Internet.
I used llama.cpp with 7B and 13B models. I was able to run the raw models in my terminal but I had a lot of performance and memory usage issues.
I made more research and learned that untuned LLaMA models require a lot more effort then fine tuned versions especially on Apple M1/M2 chips. And that's when I found information about stuff like Vicuna, LoRAs and text-generation-webui.
After a lot of effort I was able to make it work on my computer, with good performances and a ChatGPT-like experience. Hurrah! 🎉
Purpose of this document

The purpose of this document is to help people who want to try LLaMA models on their Apple M1/M2 chips and to share my results with the community.
It is not a step-by-step tutorial but I will provide links to the documentation I used to make this work and also the tools and versions of libraries I used.
Models informations and specs

I used the same initial prompt and the same settings for all my tests.

I wanted to compare the performance using different model weights and different quantization methods.

Vicuna 7B (q5_0)
Vicuna 7B (q5_1)
Vicuna 13B (q5_0)
Vicuna 13B (q5_1)

Models used

Here's the links to the Hugging Face models I used:

eachadea/ggml-vicuna-7b-1.1
eachadea/ggml-vicuna-13b-1.1

Computer specs

Here's the specs of the computer I used to generate the results:

Apple MacBook Air (M2, 2022)
Apple Silicon M2 chip with 8-Core CPU, 8-Core GPU, 16GB Unified Memory and 512GB SSD
macOS Ventura 13.3.1

Tools and libraries used

Here's the tools and libraries I used to make this work:

text-generation-webui (latest version)
Python 3.11.3
A Python virtual environment (didn't want to mess with Conda)
llama-cpp-python v0.1.41
Nightly versions of PyTorch, torchvision, and torchaudio

Optional: not used in this test but needed for many LLaMA fine-tunes available on Hugging Face or Torrent sources.

oobabooga's LLaMA tokenizer

Initial prompt

Fun fact: I used ChatGPT with the GPT-4 model to generate this prompt 😁

I told him that I wanted to make a performance comparison of LLaMA models using a maximum of 2000 tokens and that I wanted a concise and challenging prompt.
Here's the prompt that was generated:

Imagine a futuristic scenario in which an international organization is working on a large-scale project to deploy swarms of nanobots for terraforming Mars. Discuss the potential risks and benefits of such an endeavor, considering aspects such as environmental impact, potential human colonization, technological advancements, and the effects on international cooperation and competition. Also, address the role of AI and large language models in assisting with decision-making, problem-solving, and communication during this project, while considering any ethical concerns that may arise in the process.

Nice one ChatGPT! 👍
Model settings

I have really limited knowledge about what all these settings do. They might not be optimal but they worked for me.

Max_tokens: 2000
top_p: 0.85
top_k: 50
temperature: 0.7
repetition_penalty: 1.2
typical_p: 0.5
Web UI mode: chat (start the UI with the --chat parameter then select instruct in the UI)
Instruction template: Vicuna-v1

Good to know: you can create a text preset file in the presets folder. Here's mine:
do_sample=True
top_p=0.85
top_k=50
temperature=0.7
repetition_penalty=1.2
typical_p=0.5

Command and parameters used to start the UI

I learned that for my computer specs, the best way to run this was to use 4 threads. It might be different for you so you should look at the links I provided at the end of this document.
python server.py --threads 4 --mlock --chat
Not sure if the --mlock parameter is still needed on Apple M1/M2 chips as they have a unified memory architecture but after reading some feedback on the Internet it looks like it improves performances.
Performance results


Model weight
Quantization method
Total time
Tokens/s
Used Tokens
Seed


7B
q5_0
45.97s
6.20 tokens/s
285
1582921988


7B
q5_1
57.05s
5.89 tokens/s
336
724237145


13B
q5_0
222.42s
3.54 tokens/s
788
539278247


13B
q5_1
133.39s
3.26 tokens/s
435
2042331722


Conclusion

Well as I said I'm not an expert and I don't know if these results are good or not so feel free to make your own tests and share your results in the comments of this Gist.
The only thing I can say is that I'm really impressed by the performances of LLaMA models running locally on my computer. I'm also impressed by the quality of the results. The generated responses are really good and I'm sure that with a little bit of fine-tuning it could be even better.
It was a really fun experience and I learned a lot of things.

Having my own ChatGPT-like model running locally on my computer is really cool even if for now it is absolutely not comparable to the quality of the GPT-4 model.
I hope this Gist will help some people to make LLaMA models work on their Apple M1/M2 chips. Feel free to ask me questions and share your results in the comments of this Gist. You can also ask me questions on Twitter @Manoz.
Useful links


llama.cpp
text-generation-webui
Lot of documentation in r/LocalLLaMA wiki
Lot of documentation in text-generation-webui wiki
CPU threads details for Apple Silicon chips
Model weight	Quantization method	Total time	Tokens/s	Used Tokens	Seed
7B	q5_0	45.97s	6.20 tokens/s	285	1582921988
7B	q5_1	57.05s	5.89 tokens/s	336	724237145
13B	q5_0	222.42s	3.54 tokens/s	788	539278247
13B	q5_1	133.39s	3.26 tokens/s	435	2042331722