Skip to content

Instantly share code, notes, and snippets.

View devansh21640's full-sized avatar

Devansh Jindal devansh21640

View GitHub Profile
@devansh21640
devansh21640 / Results.md
Created April 9, 2026 13:38
Challenge LLM Inference Benchmark

Inference Benchmark: vLLM vs LMDeploy on Kaggle T4 (AWQ 7B)

Setup

  • Model: Qwen/Qwen2.5-7B-Instruct-AWQ (4-bit AWQ quantisation)
  • Hardware: Kaggle T4 GPU (15GB VRAM)
  • vLLM: 0.7.1 | LMDeploy: TurbomindEngine
  • Measurement: time.perf_counter() around full generation; vLLM run in offline LLMEngine mode (scheduling latency captured, not true streaming TTFT)
  • Batch size: 1 | Runs per prompt: 1 (first-run warmup included)
  • SGLang: attempted but sgl_kernel wheel on Kaggle is compiled for SM100 (H100); T4 is SM75 — incompatible at import time