Skip to content

Instantly share code, notes, and snippets.

@linux-china
Last active July 23, 2023 14:30
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save linux-china/ccf6b80acc4841a74135032497f8dc1c to your computer and use it in GitHub Desktop.
Save linux-china/ccf6b80acc4841a74135032497f8dc1c to your computer and use it in GitHub Desktop.
Run Llama-2-13B-chat RESTful server locally on your M1/M2/Intel Mac with GPU inference.
### Llama 2 Chat
POST http://127.0.0.1:8080/completion
Content-Type: application/json
{
"prompt": "What is Java Language?",
"temperature": 0.7
}
### Llama 2 tokenize
POST http://127.0.0.1:8080/tokenize
Content-Type: application/json
{
"content": "What is Java Language?"
}
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build it. If your Mac's processor is Intel CPU, please remove `LLAMA_METAL=1`
LLAMA_METAL=1 make
# Download model
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
# export MODEL=llama-2-70b-chat.ggmlv3.q4_0.bin
wget "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}"
# Run
./server -t 8 -ngl 1 -m ${MODEL}
# curl to test API
curl -X POST --location "http://127.0.0.1:8080/completion" \
-H "Content-Type: application/json" \
-d "{
\"prompt\": \"What is Java Language?\",
\"temperature\": 0.7
}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment