Skip to content

Instantly share code, notes, and snippets.

@Manoz
Last active March 9, 2024 16:26
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Manoz/7113472524b611e9a5da714522b701ae to your computer and use it in GitHub Desktop.
Save Manoz/7113472524b611e9a5da714522b701ae to your computer and use it in GitHub Desktop.
Informations, libraries and setup used to run a LLaMA model locally on Apple Silicon M1/M2 chips

Running LLaMAs models locally on Apple Silicon M1/M2 chips using a nice Web UI

Disclaimer: I'm not a data scientist or an expert with LLaMA models or LLMs thus I won't cover technical details about LLaMA models, Vicuna or the settings used in my tests. I just wanted to play with LLaMA models and share my setup and results with the community.
I also consider that you have some basic knowledge about using a terminal and running Python scripts.
I will also not cover the installation process of the tools and libraries used in this document but I will provide links to the documentation I used to make this work on my computer.
Finally, I'm not a native English speaker so please excuse my English mistakes ๐Ÿ™ƒ

Introduction:

I wanted to try to run a LLaMA model on my computer. Since I had absolutely no knowledge about this I started by reading a lot of documentation and articles on the Internet.

I used llama.cpp with 7B and 13B models. I was able to run the raw models in my terminal but I had a lot of performance and memory usage issues.

I made more research and learned that untuned LLaMA models require a lot more effort then fine tuned versions especially on Apple M1/M2 chips. And that's when I found information about stuff like Vicuna, LoRAs and text-generation-webui.

After a lot of effort I was able to make it work on my computer, with good performances and a ChatGPT-like experience. Hurrah! ๐ŸŽ‰

Purpose of this document

The purpose of this document is to help people who want to try LLaMA models on their Apple M1/M2 chips and to share my results with the community.

It is not a step-by-step tutorial but I will provide links to the documentation I used to make this work and also the tools and versions of libraries I used.

Models informations and specs

I used the same initial prompt and the same settings for all my tests.
I wanted to compare the performance using different model weights and different quantization methods.

  • Vicuna 7B (q5_0)
  • Vicuna 7B (q5_1)
  • Vicuna 13B (q5_0)
  • Vicuna 13B (q5_1)

Models used

Here's the links to the Hugging Face models I used:

Computer specs

Here's the specs of the computer I used to generate the results:

  • Apple MacBook Air (M2, 2022)
  • Apple Silicon M2 chip with 8-Core CPU, 8-Core GPU, 16GB Unified Memory and 512GB SSD
  • macOS Ventura 13.3.1

Tools and libraries used

Here's the tools and libraries I used to make this work:

  • text-generation-webui (latest version)
  • Python 3.11.3
  • A Python virtual environment (didn't want to mess with Conda)
  • llama-cpp-python v0.1.41
  • Nightly versions of PyTorch, torchvision, and torchaudio

Optional: not used in this test but needed for many LLaMA fine-tunes available on Hugging Face or Torrent sources.

Initial prompt

Fun fact: I used ChatGPT with the GPT-4 model to generate this prompt ๐Ÿ˜
I told him that I wanted to make a performance comparison of LLaMA models using a maximum of 2000 tokens and that I wanted a concise and challenging prompt.

Here's the prompt that was generated:

Imagine a futuristic scenario in which an international organization is working on a large-scale project to deploy swarms of nanobots for terraforming Mars. Discuss the potential risks and benefits of such an endeavor, considering aspects such as environmental impact, potential human colonization, technological advancements, and the effects on international cooperation and competition. Also, address the role of AI and large language models in assisting with decision-making, problem-solving, and communication during this project, while considering any ethical concerns that may arise in the process.

Nice one ChatGPT! ๐Ÿ‘

Model settings

I have really limited knowledge about what all these settings do. They might not be optimal but they worked for me.

  • Max_tokens: 2000
  • top_p: 0.85
  • top_k: 50
  • temperature: 0.7
  • repetition_penalty: 1.2
  • typical_p: 0.5
  • Web UI mode: chat (start the UI with the --chat parameter then select instruct in the UI)
  • Instruction template: Vicuna-v1

Good to know: you can create a text preset file in the presets folder. Here's mine:

do_sample=True
top_p=0.85
top_k=50
temperature=0.7
repetition_penalty=1.2
typical_p=0.5

Command and parameters used to start the UI

I learned that for my computer specs, the best way to run this was to use 4 threads. It might be different for you so you should look at the links I provided at the end of this document.

python server.py --threads 4 --mlock --chat

Not sure if the --mlock parameter is still needed on Apple M1/M2 chips as they have a unified memory architecture but after reading some feedback on the Internet it looks like it improves performances.

Performance results

Model weight Quantization method Total time Tokens/s Used Tokens Seed
7B q5_0 45.97s 6.20 tokens/s 285 1582921988
7B q5_1 57.05s 5.89 tokens/s 336 724237145
13B q5_0 222.42s 3.54 tokens/s 788 539278247
13B q5_1 133.39s 3.26 tokens/s 435 2042331722

Conclusion

Well as I said I'm not an expert and I don't know if these results are good or not so feel free to make your own tests and share your results in the comments of this Gist.

The only thing I can say is that I'm really impressed by the performances of LLaMA models running locally on my computer. I'm also impressed by the quality of the results. The generated responses are really good and I'm sure that with a little bit of fine-tuning it could be even better.

It was a really fun experience and I learned a lot of things.
Having my own ChatGPT-like model running locally on my computer is really cool even if for now it is absolutely not comparable to the quality of the GPT-4 model.

I hope this Gist will help some people to make LLaMA models work on their Apple M1/M2 chips. Feel free to ask me questions and share your results in the comments of this Gist. You can also ask me questions on Twitter @Manoz.

Useful links

@mindonscreen
Copy link

Great post! Can't wait to try this out! ๐Ÿฅ‡

@Manoz
Copy link
Author

Manoz commented Jul 22, 2023

Oh thanks! Hope it will help. Be careful tho it might not be up to date as there are new stuff every week :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment