-
Installed Ubuntu-24.04 from the Mircosot Store -> Types WSL and Selected the most recent Ubuntu
-
Windows Terminal -> wsl -d Ubuntu-24.04 (using -d to specify distro name, I have multiple distros)
-
Installing SGLang
Copying and pasting from docs (https://docs.sglang.ai/start/install.html) didn't quite work.
sudo apt install python3-pip pip3 install sgl-kernel --force-reinstall --no-deps --break-system-packages pip3 install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python --break-system-packages
- HF CLI Login to Download Models from the Hub
sudo apt install pipx pipx install huggingface_hub[cli] # to fix command line
Close Terminal and reconnect
huggingface-cli login # follow instructions, create and copy new token from HF web site
Prompt "Tell me about Mars!" (no system message, Gemma fails when one is provided), Gemma 2 9B, through OpenAI Endpoint, Token per Second (measuring via https://github.com/maxim-saplin/py_chat_ui TPS counter in the UI), single user, no parallel generations
- Safetensors, Orig, bf16,
python3 -m sglang.launch_server --model-path google/gemma-2-9b-it --host 0.0.0.0
- 4-5 minutes to load, VRAM 22.1GB
- 47.1, 47.52, 47.91, 46.89
- Safetensors, Quant, fp8,
python3 -m sglang.launch_server --model-path google/gemma-2-9b-it --quantization fp8 --host 0.0.0.0
- 4-5 minutes to load, VRAM 22.5GB
- 71.43, 70.39, 71.21
- Safetensors, Quant, w8a8_int8,
python3 -m sglang.launch_server --model-path google/gemma-2-9b-it --quantization w8a8_int8 --host 0.0.0.0
- FAILED
- GGUF, Quant Q8,
python3 -m sglang.launch_server --model-path /mnt/g/models/lmstudio-community/gemma-2-9b-it-GGUF/gemma-2-9b-it-Q8_0.gguf --host 0.0.0.0
- FAILED
v.0.3.9, CUDA llama.cpp (Windows), LM Runtime, v1.15.3
- GGUF, q8, 8k ctx, Flash Attention
- Load time ~10 seconds, VRAM 14GB
- 64.86, 64.69, 55.99, 59.57, 65.59, 64.98
- GGUF, f16, 8k ctx, Flash Attention
- Load time ~20 seconds, VRAM 22.1GB
- 42.75, 44.22, 44.98
- VRAM consumptin at idle at ~1.2-1.4GB