Brief summary of how to run a llamafile on a GCP instance and connect to it via e.g. gptel.el to have nice access to an LLM. The server is about $1.20 per hour. We'll present how to shut of the server automatically once no requests have been made for 1h.
Essentially, I'm presenting
- A systemd service that starts our llm on port 8081.
- A systemd service that monitors port 8081 and pipes requests to a file
- A systemd service that monitors the file and shuts down the instance when the file's last modification is >1h.
- A small configuration snippet to connect to it from emacs.
-
Get a GCP instance with a GPU with 40GB VRAM, i.e. the A100 instances. You need about 70GB or so disk space, but not a lot of RAM or cpu. I recommend getting a spot instance, as it roughly reduces the price by 3x. The final price comes in at about $1.20 per hour.
-
Obtain a llamafile, e.g. this one with
wget
. Make sure you can run it (i.e.chmod
it). -
Basically copy over the service files and scripts, reload systemd with
sudo systemctl daemon-reload
. -
Then enable them all with
systemctl enable --now launch_llamafile.service
etc. -
Forward port
8081
via ssh, i.e.ssh -NL 8081:localhost:8081 instance.zone
. -
Configure your local client, e.g. in emacs
(use-package! gptel
:config
(let ((backend (gptel-make-openai ;Not a typo, same API as OpenAI
"llama-cpp" ;Any name
:stream t ;Stream responses
:protocol "http"
:host "localhost:8081" ;Llama.cpp server location, typically localhost:8080 for Llamafile
:key nil ;No key needed
:models '("test")))) ;Any names, doesn't matter for Llama
(setq-default gptel-backend backend
gptel-model "test")))