Skip to content

Instantly share code, notes, and snippets.

@jmftrindade
Last active December 27, 2023 16:29
Show Gist options
  • Save jmftrindade/293d670e7da65d447ae7dfba1a5a5f5d to your computer and use it in GitHub Desktop.
Save jmftrindade/293d670e7da65d447ae7dfba1a5a5f5d to your computer and use it in GitHub Desktop.
Mistral 7B locally on OSX with llama.cpp

Clone and build llama.cpp

As of writing (Nov 22nd 2023), METAL is enabled by default, and arm64 is correctly detected: no need for special cmake flags.

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/
mkdir build && cd build/ && cmake .. && make -j && cd ..

Download the model

Choose your version of Mistral 7B on hugging face. I went with OpenOrca's Q5_K_S.

cd models/
wget https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/resolve/main/mistral-7b-openorca.Q4_K_M.gguf
cd ..

For a ranking of different models, see Local LLM Comparison

Run inference

To load the model in interactive mode (-i) using a basic config suited to OSX M2 with 16GB (10 threads, all 32 layers on GPU):

./build/bin/main -t 10 -ngl 32 \
  -m "/path/to/mistral-7b-openorca.Q5_K_S.gguf" \
  --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 \
  -i -ins

With the config above, I get a little over 90 tokens per second.

Refer to Mistral on hugging face + Mistral OpenOrca for general documentation on model, and ./build/bin/main -h for help with llama.cpp flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment