- Model: Qwen/Qwen2.5-7B-Instruct-AWQ (4-bit AWQ quantisation)
- Hardware: Kaggle T4 GPU (15GB VRAM)
- vLLM: 0.7.1 | LMDeploy: TurbomindEngine
- Measurement:
time.perf_counter()around full generation; vLLM run in offlineLLMEnginemode (scheduling latency captured, not true streaming TTFT) - Batch size: 1 | Runs per prompt: 1 (first-run warmup included)
- SGLang: attempted but
sgl_kernelwheel on Kaggle is compiled for SM100 (H100); T4 is SM75 — incompatible at import time