Skip to content

Instantly share code, notes, and snippets.

@jumping
Forked from adrienbrault/llama2-mac-gpu.sh
Created September 12, 2023 14:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jumping/13b1cfb6923d82e26c0bfdb31e95a978 to your computer and use it in GitHub Desktop.
Save jumping/13b1cfb6923d82e26c0bfdb31e95a978 to your computer and use it in GitHub Desktop.
Run Llama-2-13B-chat locally on your M1/M2 Mac with GPU inference. Uses 10GB RAM. UPDATE: see https://twitter.com/simonw/status/1691495807319674880?s=20
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build it
make clean
LLAMA_METAL=1 make
# Download model
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
wget "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}"
# Run
echo "Prompt: " \
&& read PROMPT \
&& ./main \
--threads 8 \
--n-gpu-layers 1 \
--model ${MODEL} \
--color \
--ctx-size 2048 \
--temp 0.7 \
--repeat_penalty 1.1 \
--n-predict -1 \
--prompt "[INST] ${PROMPT} [/INST]"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment