jmftrindade/mistral_7B_llama_OSX_M2.md

## mistral_7B_llama_OSX_M2.md

      
    Raw
  

              mistral_7B_llama_OSX_M2.md
            
          
    Clone and build llama.cpp

As of writing (Nov 22nd 2023), METAL is enabled by default, and arm64 is correctly detected: no need for special cmake flags.
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/
mkdir build && cd build/ && cmake .. && make -j && cd ..

Download the model

Choose your version of Mistral 7B on hugging face. I went with OpenOrca's Q5_K_S.
cd models/
wget https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/resolve/main/mistral-7b-openorca.Q4_K_M.gguf
cd ..

For a ranking of different models, see Local LLM Comparison
Run inference

To load the model in interactive mode (-i) using a basic config suited to OSX M2 with 16GB (10 threads, all 32 layers on GPU):
./build/bin/main -t 10 -ngl 32 \
  -m "/path/to/mistral-7b-openorca.Q5_K_S.gguf" \
  --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 \
  -i -ins

With the config above, I get a little over 90 tokens per second.
Refer to Mistral on hugging face + Mistral OpenOrca for general documentation on model, and ./build/bin/main -h for help with llama.cpp flags.