maxim-saplin/SGLang vs Llama.cpp.md

## SGLang vs Llama.cpp.md

      
    Raw
  

              SGLang vs Llama.cpp.md
            
          
    Install on WSL2


Installed Ubuntu-24.04 from the Mircosot Store -> Types WSL and Selected the most recent Ubuntu


Windows Terminal -> wsl -d Ubuntu-24.04 (using -d to specify distro name, I have multiple distros)


Installing SGLang


Copying and pasting from docs (https://docs.sglang.ai/start/install.html) didn't quite work.
sudo apt install python3-pip
pip3 install sgl-kernel --force-reinstall --no-deps --break-system-packages
pip3 install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python --break-system-packages

HF CLI Login to Download Models from the Hub

sudo apt install pipx
pipx install huggingface_hub[cli] # to fix command line
Close Terminal and reconnect
huggingface-cli login # follow instructions, create and copy new token from HF web site
Testing Speed

Prompt "Tell me about Mars!" (no system message, Gemma fails when one is provided), Gemma 2 9B, through OpenAI Endpoint, Token per Second (measuring via https://github.com/maxim-saplin/py_chat_ui TPS counter in the UI), single user, no parallel generations
SGLang


Safetensors, Orig, bf16, python3 -m sglang.launch_server --model-path google/gemma-2-9b-it --host 0.0.0.0

4-5 minutes to load, VRAM 22.1GB
47.1, 47.52, 47.91, 46.89


Safetensors, Quant, fp8, python3 -m sglang.launch_server --model-path google/gemma-2-9b-it --quantization fp8 --host 0.0.0.0

4-5 minutes to load, VRAM 22.5GB
71.43, 70.39, 71.21


Safetensors, Quant, w8a8_int8, python3 -m sglang.launch_server --model-path google/gemma-2-9b-it --quantization w8a8_int8 --host 0.0.0.0

FAILED


GGUF, Quant Q8, python3 -m sglang.launch_server --model-path /mnt/g/models/lmstudio-community/gemma-2-9b-it-GGUF/gemma-2-9b-it-Q8_0.gguf --host 0.0.0.0

FAILED


LM Studio

v.0.3.9, CUDA llama.cpp (Windows), LM Runtime, v1.15.3

GGUF, q8, 8k ctx, Flash Attention

Load time ~10 seconds, VRAM 14GB
64.86, 64.69, 55.99, 59.57, 65.59, 64.98


GGUF, f16, 8k ctx, Flash Attention

Load time ~20 seconds, VRAM 22.1GB
42.75, 44.22, 44.98


VRAM consumptin at idle at ~1.2-1.4GB