Skip to content

Instantly share code, notes, and snippets.

@RomeoV
Last active January 9, 2024 14:41
Show Gist options
  • Save RomeoV/5000a6026821c588cae69b5e249e8f5f to your computer and use it in GitHub Desktop.
Save RomeoV/5000a6026821c588cae69b5e249e8f5f to your computer and use it in GitHub Desktop.
Run own LLM on GCP instance.

Brief summary of how to run a llamafile on a GCP instance and connect to it via e.g. gptel.el to have nice access to an LLM. The server is about $1.20 per hour. We'll present how to shut of the server automatically once no requests have been made for 1h.

Essentially, I'm presenting

  • A systemd service that starts our llm on port 8081.
  • A systemd service that monitors port 8081 and pipes requests to a file
  • A systemd service that monitors the file and shuts down the instance when the file's last modification is >1h.
  • A small configuration snippet to connect to it from emacs.

Installation

  • Get a GCP instance with a GPU with 40GB VRAM, i.e. the A100 instances. You need about 70GB or so disk space, but not a lot of RAM or cpu. I recommend getting a spot instance, as it roughly reduces the price by 3x. The final price comes in at about $1.20 per hour.

  • Obtain a llamafile, e.g. this one with wget. Make sure you can run it (i.e. chmod it).

  • Basically copy over the service files and scripts, reload systemd with sudo systemctl daemon-reload.

  • Then enable them all with systemctl enable --now launch_llamafile.service etc.

  • Forward port 8081 via ssh, i.e. ssh -NL 8081:localhost:8081 instance.zone.

  • Configure your local client, e.g. in emacs

(use-package! gptel
  :config
  (let ((backend (gptel-make-openai                    ;Not a typo, same API as OpenAI
                  "llama-cpp"                          ;Any name
                  :stream t                            ;Stream responses
                  :protocol "http"
                  :host "localhost:8081"               ;Llama.cpp server location, typically localhost:8080 for Llamafile
                  :key nil                             ;No key needed
                  :models '("test"))))                   ;Any names, doesn't matter for Llama

    (setq-default gptel-backend backend
                  gptel-model   "test")))
### FILE: /etc/systemd/system/file_monitor.service
[Unit]
Description=Monitor a file and shutdown if it's not modified for more than 1 hour
[Service]
Type=simple
ExecStart=/bin/bash /usr/local/bin/monitor_file.sh "/tmp/requests.log"
Restart=always
RestartSec=60
[Install]
WantedBy=multi-user.target
### FILE: /etc/systemd/system/launch_llamafile.service
[Unit]
Description=Dolphin llamafile server.
[Service]
ExecStart=/bin/bash /home/romeo/dolphin-2.5-mixtral-8x7b.Q5_K_M.llamafile -ngl 35
User=romeo
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
### FILE: /usr/local/bin/monitor_file.sh
#!/bin/bash
FILE=$1
touch $FILE
MODTIME=$(date -r "$FILE" +%s)
NOW=$(date +%s)
DIFF=$((NOW-MODTIME))
if [ $DIFF -gt 3600 ]; then
echo "Shutting down due to inactivity of ${FILE}"
shutdown now
fi
### FILE: /usr/local/bin/monitor_ports.sh
tcpdump -i lo 'dst port 8081' >> /tmp/requests.log
### FILE: /etc/systemd/system/port_monitor.service
[Unit]
Description=Monitor port 8081.
[Service]
Type=simple
ExecStart=/bin/bash /usr/local/bin/monitor_ports.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment