Skip to content

Instantly share code, notes, and snippets.

@hughpearse
Last active December 5, 2023 15:14
Show Gist options
  • Save hughpearse/8dd99284a1c984dc55e578a766267368 to your computer and use it in GitHub Desktop.
Save hughpearse/8dd99284a1c984dc55e578a766267368 to your computer and use it in GitHub Desktop.

Run local LLM OpenAI Compatible Inference Server (llama.cpp)

Here I describe how to quickly get set up with a local LLM OpenAI compatible inference server.

Install Docker (RHEL 7)

Docker is a prerequisite. Complete the following instructions to set up docker

dc-user@devcloud$ sudo yum update
dc-user@devcloud$ sudo yum install -y yum-utils
dc-user@devcloud$ sudo yum-config-manager --add-repo http://yum.oracle.com/public-yum-ol7.repo
dc-user@devcloud$ sudo yum-config-manager --enable *addons
dc-user@devcloud$ sudo yum update
dc-user@devcloud$ sudo yum install docker-engine
dc-user@devcloud$ sudo systemctl enable --now docker
dc-user@devcloud$ sudo groupadd docker
dc-user@devcloud$ sudo usermod -aG docker ${USER}
dc-user@devcloud$ sudo systemctl reboot

Download a language model

Select a model from HuggingFace.

dc-user@devcloud$ mkdir models
dc-user@devcloud$ curl -L https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q2_K.gguf --output ./models/llama-2-7b-chat.Q2_K.gguf

Launch the inference server

This launches both llama.cpp and the OpenAI mocking script.

dc-user@devcloud$ docker run -p 8080:8080 -p 8081:8081 -v $PWD/models:/models --entrypoint "/bin/bash" ghcr.io/ggerganov/llama.cpp:full-c5b49360d0d9e49f32e05a9116e90bd0b39a282d -c "python3 -m pip install Flask==3.0.0 requests==2.31.0 urllib3==2.1.0; /app/.devops/tools.sh --server -m /models/llama-2-7b-chat.Q2_K.gguf -c 2048 -ngl 43 -mg 1 --host 0.0.0.0 --port 8080 & /app/examples/server/api_like_OAI.py --host 0.0.0.0"

Test it

dc-user@devcloud$ curl --request POST --url http://localhost:8081/v1/completions --header "Content-Type: application/json" --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment