Skip to content

Instantly share code, notes, and snippets.

@maxim-saplin
Last active February 16, 2025 20:19

Install on WSL2

  1. Installed Ubuntu-24.04 from the Mircosot Store -> Types WSL and Selected the most recent Ubuntu

  2. Windows Terminal -> wsl -d Ubuntu-24.04 (using -d to specify distro name, I have multiple distros)

  3. Installing SGLang

Copying and pasting from docs (https://docs.sglang.ai/start/install.html) didn't quite work.

sudo apt install python3-pip pip3 install sgl-kernel --force-reinstall --no-deps --break-system-packages pip3 install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python --break-system-packages

  1. HF CLI Login to Download Models from the Hub

sudo apt install pipx pipx install huggingface_hub[cli] # to fix command line

Close Terminal and reconnect

huggingface-cli login # follow instructions, create and copy new token from HF web site

Testing Speed

Prompt "Tell me about Mars!" (no system message, Gemma fails when one is provided), Gemma 2 9B, through OpenAI Endpoint, Token per Second (measuring via https://github.com/maxim-saplin/py_chat_ui TPS counter in the UI), single user, no parallel generations

SGLang

  • Safetensors, Orig, bf16, python3 -m sglang.launch_server --model-path google/gemma-2-9b-it --host 0.0.0.0
    • 4-5 minutes to load, VRAM 22.1GB
    • 47.1, 47.52, 47.91, 46.89
  • Safetensors, Quant, fp8, python3 -m sglang.launch_server --model-path google/gemma-2-9b-it --quantization fp8 --host 0.0.0.0
    • 4-5 minutes to load, VRAM 22.5GB
    • 71.43, 70.39, 71.21
  • Safetensors, Quant, w8a8_int8, python3 -m sglang.launch_server --model-path google/gemma-2-9b-it --quantization w8a8_int8 --host 0.0.0.0
    • FAILED
  • GGUF, Quant Q8, python3 -m sglang.launch_server --model-path /mnt/g/models/lmstudio-community/gemma-2-9b-it-GGUF/gemma-2-9b-it-Q8_0.gguf --host 0.0.0.0
    • FAILED

LM Studio

v.0.3.9, CUDA llama.cpp (Windows), LM Runtime, v1.15.3

  • GGUF, q8, 8k ctx, Flash Attention
    • Load time ~10 seconds, VRAM 14GB
    • 64.86, 64.69, 55.99, 59.57, 65.59, 64.98
  • GGUF, f16, 8k ctx, Flash Attention
    • Load time ~20 seconds, VRAM 22.1GB
    • 42.75, 44.22, 44.98
  • VRAM consumptin at idle at ~1.2-1.4GB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment