Follow this guide https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md
Download a GGUF quantized model. Pick a recommended one, e.g. codellama-7b.Q5_K_M.gguf
Check how many GPU cores you have https://www.reddit.com/r/macbook/comments/o3k9a1/comment/h2c9jmu/?utm_source=share&utm_medium=web2x&context=3
Run the server
python -m llama_cpp.server --model $MODEL --n_gpu_layers 14
View docs http://localhost:8000/docs
Chat
>>> cat > chat.json
{
"prompt": "USER: Tell me something interesting.\nASSISTANT:",
"stop": ["USER:"]
}
>>> curl -X POST -H "Content-Type: application/json" -d @chat.json http://localhost:8000/v1/completions